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Preface 


The  Computer  Science  Research  Institute  (CSRI)  brings  university  faculty  and  students  to 
Sandia  National  Laboratories  for  focused  collaborative  research  on  computer  science,  com¬ 
putational  science,  and  mathematics  problems  that  are  critical  to  the  mission  of  the  laborato¬ 
ries,  the  Department  of  Energy,  and  the  United  States.  CSRI  provides  a  mechanism  by  which 
university  researchers  learn  about  and  impact  national-  and  global-scale  problems  while  si¬ 
multaneously  bringing  new  ideas  from  the  academic  research  community  to  bear  on  these 
important  problems. 

A  key  component  of  CSRI  programs  over  the  last  decade  has  been  an  active  and  produc¬ 
tive  summer  program  where  students  from  around  the  country  conduct  internships  at  CSRI. 
Each  student  is  paired  with  a  Sandia  staff  member  who  serves  as  technical  advisor  and  men¬ 
tor.  The  goals  of  the  summer  program  are  to  expose  the  students  to  research  in  mathematical 
and  computer  sciences  at  Sandia  and  to  conduct  a  meaningful  and  impactful  summer  research 
project  with  their  Sandia  mentor.  Every  effort  is  made  to  align  summer  projects  with  the  stu¬ 
dent’s  research  objectives  and  all  work  is  coordinated  with  the  ongoing  research  activities 
of  the  Sandia  mentor  in  alignment  with  Sandia  technical  thrusts  and  the  needs  of  the  NNSA 
Advanced  Scientific  Computing  (ASC)  program  that  has  funded  CSRI  from  its  onset. 

Starting  in  2006,  CSRI  has  encouraged  all  summer  participants  and  their  mentors  to  con¬ 
tribute  a  technical  article  to  the  CSRI  Summer  Proceedings,  of  which  this  document  is  the 
fifth  installment.  In  many  cases,  the  CSRI  proceedings  are  the  first  opportunity  that  students 
have  to  write  a  research  article.  Not  only  do  these  proceedings  serve  to  document  the  re¬ 
search  conducted  at  CSRI  but,  as  part  of  the  research  training  goals  of  CSRI,  it  is  the  intent 
that  these  articles  serve  as  precursors  to  or  first  drafts  of  articles  that  could  be  submitted  to 
peer-reviewed  journals.  As  such,  each  article  has  been  reviewed  by  a  Sandia  staff  member 
knowledgeable  in  that  technical  area  with  feedback  provided  to  the  authors.  Several  articles 
have  or  are  in  the  process  of  being  submitted  to  peer-reviewed  conferences  or  journals  and 
we  anticipate  that  additional  submissions  will  be  forthcoming. 

For  the  2010  CSRI  Proceedings,  research  articles  have  been  organized  into  the  follow¬ 
ing  broad  technical  focus  areas  —  computational  mathematics  and  algorithms,  uncertainty 
quantification  and  sensitivity  analysis,  meshing  and  optimization,  computational  applica¬ 
tions,  architectures  and  networking,  and  visualization  and  software  engineering  —  which  are 
well  aligned  with  Sandia’s  strategic  thrusts  in  computer  and  information  sciences. 

We  would  like  to  thank  all  participants  who  have  contributed  to  the  outstanding  technical 
accomplishments  of  CSRI  in  2010  as  documented  by  the  high  quality  articles  in  this  proceed¬ 
ings.  The  success  of  CSRI  hinged  on  the  hard  work  of  36  enthusiastic  student  collaborators 
and  their  dedicated  Sandia  technical  staff  mentors.  It  is  truly  impressive  that  the  research 
described  herein  occurred  primarily  over  a  three  month  period  of  intensive  collaboration. 

CSRI  benefited  from  the  administrative  help  of  Dee  Cadena,  Deanna  Ceballos,  Denis 
Laporte,  Mel  Loran,  and  Bernadette  Watts.  The  success  of  CSRI  is,  in  large  part,  due  to  their 
dedication  and  care,  which  are  much  appreciated.  We  would  also  like  to  thank  those  who 
reviewed  articles  for  this  proceedings  —  their  feedback  is  an  important  part  of  the  research 
training  process  and  has  significantly  improved  the  quality  of  the  papers  herein.  Finally, 
we  want  to  acknowledge  the  ASC  program  for  their  continued  support  of  the  CSRI  and  its 
activities  which  have  benefited  both  Sandia  and  the  greater  research  community. 

Eric  C.  Cyr 
S.  Scott  Collis 


December  17,  2010 
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Computational  Mathematics  and  Algorithms 

Articles  in  this  section  focus  on  development  of  numerical  algorithms  and  novel  compu¬ 
tational  models.  This  includes  discretization  techniques,  preconditioning  methodologies,  and 
implementation  of  numerical  algorithms  on  novel  architectures. 

Burch  and  Lehoucq  discuss  the  nonlocal  Cattaneo-Vemotte  equation  for  anomalous  dif¬ 
fusion,  introducing  nonlocal  boundary  conditions.  The  effects  of  relaxation  and  nonlocality 
are  studied  with  a  finite  element  formulation.  Lai  et  al.  present  a  new  least  squares  finite  el¬ 
ement  formulation  for  the  Stokes  equations  that  has  improved  mass  conservation.  To  achieve 
this,  the  formulation  utilizes  a  novel  discontinuous  approximation  of  the  velocity.  Results 
demonstrating  the  usefulness  of  this  approach  are  given.  Roberts  et  al.  develop  a  discon¬ 
tinuous  Petrov- Galerkin  method  for  the  Stokes  equations.  There  study  and  the  numerical 
experiments  exposed  several  interesting  avenues  for  future  study.  Phillips  et  al.  investigate 
block  preconditioners  for  the  unsteady  Navier- Stokes  equations.  The  authors  perform  numer¬ 
ical  experiments  comparing  several  techniques  for  approximate  the  pressure  Schur  comple¬ 
ment.  Ballard  et  al.  present  an  efficient  implementation  of  the  shifted  symmetric  higher  order 
power  method  for  computing  tensor  eigenvalues.  They  show  that  a  70x  speedup  over  a  serial 
implementation  can  be  achieved  when  implemented  on  a  GPU. 


E.C.  Cyr 
S.S.  Collis 


December  17,  2010 
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THE  NONLOCAL  CATTANEO-VERNOTTE  EQUATION 

NATHANIAL  J.  BURCH*  AND  RICHARD  B.  LEHOUCQ1 

Abstract.  We  introduce  the  nonlocal  Cattaneo-Vernotte  equation  by  including  a  relaxation  effect  in  the  nonlocal 
diffusion  equation,  both  of  which  are  models  for  anomalous  diffusion.  This  equation  has  two  different  interpretations: 
as  a  generalization  of  Fick’s  first  law  in  terms  of  a  nonlocal  flux  and  memory  kernel  and  as  an  equation  arising  from 
the  generalized  master  equation  for  a  continuous  time  random  walk.  Both  interpretations  are  discussed  and  the  latter 
describes  a  scaling  of  relaxation  time  and  nonlocality.  A  relationship  to  fractional  diffusion  equations  in  the  limit 
of  vanishing  relaxation  time  and  nonlocality  is  also  established.  The  main  contribution  of  this  paper  is  to  introduce 
nonlocal  boundary  conditions  for  the  nonlocal  Cattaneo-Vernotte  equation,  and  the  ensuing  variational  and  finite 
element  formulations.  Examples  are  given  where  the  effects  of  relaxation  time  and  nonlocality  are  studied. 


1.  Introduction.  The  classical  Cattaneo-Vernotte  equation 

M-t  +  —  Wit  =  awxx,  (1.1) 

where  r/2  >  0  is  the  relaxation  time  and  a  >  0  is  the  diffusion  coefficient,  is  a  model  for 
diffusion  that  admits  finite  speeds  of  propagation,  specifically  ^2a/r.  When  w  is  a  tempera¬ 
ture  field,  (1.1)  is  a  model  of  hyperbolic  heat  conduction  [14].  Further,  (1.1)  arises  from  the 
classical  balance  law,  wt(x,  t)  =  -qx(x,  t ),  and  a  generalization  of  Fick’s  first  law  in  which  the 
flux  is  given  by  a  convolution  of  the  gradient  of  the  field  w  and  a  relaxation  kernel  [12], 


q(x,  t)  =  -a 


fr  2 
-exp 
J0  T 


t~t' 

r/2 


wx(x ,  t')  d t' . 


(1.2) 


The  assumption  (1.2)  also  takes  the  more  familiar  form  of  Cattaneo’s  equation  [7], 

T 

q+  -qt  =  ~awx.  (1.3) 

The  classical  diffusion  equation 


wt  =  awxx 


(1.4) 


yields  an  infinite  speed  of  propagation  because  its  fundamental  solution,  i.e.,  the  solution  to 
(1.4)  with  an  initial  condition  given  by  the  Dirac  delta  function,  is 


w(x,  t)  : 


1 


V4  nat 


exp 


V 

Aat ) 


which  is  positive  for  all  x,  for  any  arbitrarily  small  t.  This  property  has  been  referred  to  in  the 
literature  as  “unphysical”  since  disturbances  are  instantaneously  propagated.  Moreover,  (1.4) 
is  incapable  of  capturing  transient  dynamics  of  the  field  in  situations  involving  short  times, 
high  frequencies,  and  short  wave  lengths  [14].  One  approach  to  remedy  these  issues  is  to 
introduce  a  relaxation  time  [12]  and  a  special  case  of  this  is  the  classical  Cattaneo-Vernotte 
equation  (1.1),  which  overcomes  the  unphysical  properties  associated  with  infinite  speeds  of 
propagation  present  in  (1.4). 

The  diffusion  equation  (1.4)  may  be  derived  by  combining  the  classical  balance  law 
wt  =  q x  and  Fick’s  first  law 


q  =  -awx. 
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When  the  diffusion  process  is  anomalous,  e.g.,  does  not  obey  Fick’s  first  law  (1.5),  other 
models  have  been  proposed.  Examples  include  the  fractional  diffusion  equation 

v,  =  -c  (- Af/2  V,  0  <  a  <  2,  (1.6) 

which  includes  (1.4)  as  the  special  case  a  =  2  and  c  =  a,  and  the  nonlocal  diffusion  equation 

u,(x,  t)  =  1  f  (u(y,  t)  -  u(x,  t))4>(x  -  y)  dv.  (1.7) 

/(  Jr 

The  equation  (1.7)  has  a  probabilistic  interpretation  as  a  generalized  master  equation  for  a 
continuous  time  random  walk  (CTRW).  The  rate  of  diffusion  associated  with  u(x ,  t)  depends 
upon  points  y  ^  x,  e.g.,  the  rate  of  diffusion  is  the  difference  in  the  rate  at  which  u  enters  x 
at  time  t,  A~1  u(y ,  t)  0(y  -  x)  d y,  and  the  rate  at  which  u  departs  v  at  time  t ,  A~lu(x,  t).  The 
mean  wait-time  between  steps  is  A  and,  given  the  radial  probability  density  0,  the  diffusion 
coefficient  is  given  by 

Jr  f  s2(f>(s)ds. 

AA  Jr 

Like  (1.4),  both  (1.6)  and  (1.7)  give  rise  to  infinite  speeds  of  propagation.  We  note  (1.7)  has 
been  used  as  a  model  for  peridynamic  heat  conduction  [4]  and  variations  of  it  have  appeared 
in  numerous  applications  [2,  6,  10]. 

This  paper  focuses  on  the  nonlocal  Cattaneo-Vernotte  equation 

r  1  r 

ut(x,  t)  +  -utt(x,  t)  =  -  ( u(y ,  t )  -  u(x,  tj)  (f>(x  -  y)  d y,  (1.8) 

z  P  Jr 

where  0  is  a  radial  probability  density  function,  is  the  mean  wait-time  between  steps,  and 
r/2  >  0  is  again  the  relaxation  time.  In  (1.8),  the  diffusion  coefficient  is 

—  r  s2(p(s )  ds. 

W  Jr 

In  the  spirit  of  (1.7),  (1.8)  has  replaced  the  second-order  spatial  derivative  in  (1.1)  with  the 
nonlocal  integral  operator  and,  consequently,  is  a  model  for  anomalous  diffusion.  With  the 
introduction  of  nonlocal  boundary  conditions,  following  [5],  (1.8)  becomes  a  model  for  non¬ 
local  hyperbolic  heat  conduction  on  bounded  domains.  The  nonlocal  Cattaneo-Vernotte  equa¬ 
tion  (1.8),  like  (1.7)  has  a  probabilistic  interpretation.  Thus,  (1.8)  is  a  model  for  anomalous 
diffusion  that  can  be  derived  from  a  CTRW  framework. 

The  contribution  of  this  paper  is  to  investigate  the  effect  of  a  nonzero  relaxation  time  by 
comparing  solutions  of  nonlocal  boundary  value  problems  corresponding  to  (1.7)  and  (1.8). 
The  former  was  studied  extensively  in  [5]  and  a  conforming  finite  element  method,  which 
depends  on  the  variational  framework  presented  in  [11],  to  approximate  solutions  to  these 
nonlocal  boundary  value  problems  is  used.  We  demonstrate  a  relationship  between  relaxation 
time  and  nonlocality  and  the  solutions  of  (1.8)  converge  to  those  of  (1.6),  as  both  relaxation 
time  and  nonlocality  vanish.  Consequently,  a  relationship  between  (1 .6)— (1.8)  in  the  limit  of 
vanishing  relaxation  time  and  nonlocality  is  established. 

The  rest  of  this  paper  is  organized  as  follows.  Section  2  demonstrates  how  (1.8)  arises 
from  a  generalization  of  Fick’s  first  law  in  which  the  flux  is  given  by  a  convolution  of  a  mem¬ 
ory  kernel  and  a  nonlocal  spatial  operator  acting  as  the  gradient  of  the  field,  in  contrast  to 
(1.2).  The  relationships  between  (1.6)— (1.8)  are  also  reviewed.  Section  3  relates  the  nonlo¬ 
cal  Cattaneo-Vernotte  equation  to  a  continuous  time  random  walk  via  the  generalized  master 
equation.  Nonlocal  boundary  conditions  for  (1.8)  are  reviewed  in  Section  4,  as  is  the  varia¬ 
tional  formulation  and  ensuing  finite  element  method.  Section  5  provides  numerical  examples 
to  illustrate  the  effects  of  nonzero  relaxation  time  and  nonlocality. 
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2.  Generalization  of  Fick’s  first  law  and  a  nonlocal  flux.  In  this  section,  we  demon¬ 
strate  (1.8)  arises  via  the  classical  balance  law,  ut  -  -gx,  and  a  generalization  of  Fick’s  first 
law, 


g(x,t)=  f  - 

Jo  t 


exp 


1  \{\p(x,t')\dt' , 


r/2  )\p 


(2.1) 


where 


p(x,  t)  :=  -  ]-  f  f  (w(x  +  (1  -  J)z,  0  -  u(x  -  Az,  t))z(f>(z)  dA  dz. 
2  Jr  Jo 


Differentiating  (2.1)  with  respect  to  t  and  rearranging  reveals 

1 


and  so 


p  H - p,  =  -p, 

*  A  j3 


It  It 

ut  —  —£>PX  2^  =  —~pPx  ~  2Utt 


(2.2) 


(2.3) 


(2.4) 


Noll’s  Lemma  I  [15,  18]  implies  that 

1  1  r 

-„Px(x,  t')  =  -  I  (u(x  +  Z,  t')  -  u(x,  t'))<p(z)  dz, 
r '  r*  J\ R 

and  so  (1.8)  is  established. 

We  now  demonstrate  a  formal  relationship  between  (1.8)  and  (1 . 1)  in  the  presence  of  van¬ 
ishing  nonlocality.  Fix  t  >  0,  let  =  s2,  where  £  >  0,  and  define  the  symmetric  probability 
density 


<f>s(s)  :=  -<f>(s/e), 
£ 


(2.5) 


where  the  given  symmetric  density  0  satisfies1 

r  s2k(/)(s)  ds  <  oo,  k  =  0, 1 , 2 . . . . 

Jr 

As  £  — >-  0,  cj)s(x  -  y)  weights  points  nearby  v  more  heavily,  relative  to  points  further  away. 
Necessarily,  by  specifying  the  second  moment  appropriately,  the  Fourier  transform  of  0  has 
an  expansion  of  the  form 

<M£)  =  1  -  a\£\2  +  o(l£|2),  a  >  0. 

With  0£  in  place  of  0  and  assuming  a  formal  Taylor  expansion  is  valid  for  sufficiently  small 

£, 


-aux{x,  0  +  ^ 


k= 2 


k= 2 

aP-'u&t)  i 
dx2*-1  (2k  -  1)! 


dku(x,  t) 
dxk 


0e(z)  d/1  dz 


/ 


Z?k(f>e(z)  dZ 


0  +  ^  ' 


d2k~lu(x,t)  e2(k~l)  ,  2k 


k= 2 


dx2k~l  (2k 


[k- 1)  /- 

^  Jr 


z  0(z)  dz 


!The  assumption  of  symmetry  of  0  implies  that  the  odd  moments  are  zero. 
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and,  utilizing  (2.2),  we  obtain  an  approximation  of  (1.3), 

Q+^Qt  =  ~aux  +  0(e2).  (2.6) 

Thus,  in  the  absence  of  nonlocality,  the  nonlocal  Cattaneo-Vernotte  equation  (1.8)  reduces  to 
the  classical  Cattaneo-Vernotte  equation  (1.1).  The  effect  of  the  density  0e  with  =  s2  as  s 
decreases  is  to  “localize”  the  diffusion  of  (1.8).  Indeed,  taking  (f>(x-y )  =  S(x-y)  +  ad"(x-y) 
in  (1.8)  recovers  (1.1).  Moreover,  if  r  =  O (s2),  (2.6)  reduces  to 

g  =  -aux  +  0(£2),  (2.7) 

approximating  (1.5).  Thus,  in  the  absence  of  both  relaxation  time  and  nonlocality,  the  nonlo¬ 
cal  Cattaneo-Vernotte  equation  (1.8)  reduces  to  the  classical  diffusion  equation  (1.4). 

Finally,  we  establish  a  relationship  to  the  fractional  diffusion  equation  (1.6).  Suppose  0 
is  a  symmetric  probability  density  function  with  the  expansion 

m  =  i-c\s\a  +  o(\gn  o<^<2,  (2.8) 

for  c  >  0,  so  that,  defining  0e  via  (2.5), 

Mt)  =  i-csa\t\a+o(sam. 

Assuming  =  £a,  the  Fourier  transform  of  (1.8)  gives 

u,(€,  t )  +  f  ««(£  0=2  {(PM)  - 1  )u(£,  t) 

=  ^(-csa\tr  +  o{sa\t\a))u{t,t) 

£a 

=  -cifraf,o  +  o(^ifn, 

implying  that  u ,  in  a  formal  sense,  is  approximately  given  by  the  fractional  Cattaneo-Vernotte 
equation 


Vt(x,  t)  +  T- vtt(x ,  t)  =  -c(-A)“/2v(x,  t).  (2.9) 

Further,  if  r  =  0(sa),  u  is  approximately  given  by  the  fractional  diffusion  equation  (1.6), 
where  equality  can  be  shown  in  special  cases  of  this  limit  via  characteristic  function  tech¬ 
niques.  Evidently,  the  nonlocal  boundary  value  problems  corresponding  to  (1.8)  are  then 
related  to  those  of  the  fractional  Cattaneo-Vernotte  equation  (2.9)  and  fractional  diffusion 
equation  (1.6)  in  this  limit. 

3.  The  Nonlocal  Cattaneo-Vernotte  Equation  and  CTRWs.  Another  perspective  of 
the  nonlocal  Cattaneo-Vernotte  equation  (1.8)  comes  from  CTRWs.  We  consider  a  separable2 
continuous  time  random  walk  with  a  wait- time  density  oj  and  a  radial  step-length  density  0. 
One  form  of  the  generalized  master  equation,  which  is  an  equation  for  the  time  evolution  of 
the  joint  probability  density  function  for  the  state  of  the  CTRW,  is 

ut(x ,  t)  =  ^  A (t  -  t')  I"  (w(y,  t')  -  u(x ,  t'))(f)(x  -  y)  d y  d t' , 

Jo  Jr 


2 The  assumption  of  separable  simply  states  that  wait- times  and  step-lengths  are  independent. 


(3.1) 
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where  the  Laplace  transform  of  the  memory  kernel  A  is  determined  by  that  of  oj  via 


Ms)  = 


SOj(s) 

1  -20)' 


(3.2) 


We  refer  the  reader  to  [3,  16]  for  a  derivation  and  thorough  discussion  of  (3.1).  The  memory 
kernel  A  captures  potential  memory  effects  due  to  the  wait-times.  In  fact,  only  when  A (t)  oc 
6(t)  is  the  underlying  CTRW  Markovian,  i.e.,  wait-times  are  exponentially  distributed  and  are 
therefore  memory  less. 

Literature  demonstrating  the  use  of  (3.1)  in  the  modeling  of  diffusion  processes  is  plen¬ 
tiful.  For  instance,  taking  A (t)  =  \6(t)  gives  rise  to  the  nonlocal  diffusion  equation  (1.7). 
If  0  is  a  weighted  average  of  Dirac  measures  on  an  integer  lattice,  then  (3.1)  describes  the 
probability  density  of  the  CTRW  on  that  lattice;  see  [13].  Moreover,  see  [16,  17],  subdiffusive 
processes  have  been  studied  by  taking  A(s)  =  s1_/i,  p  e  (0, 1).  The  following  lemma  provides 
conditions  for  (3.1)  to  yield  (1.8). 

Lemma  3.1.  The  nonlocal  Cattaneo-Vernotte  equation  (1.8)  is  obtained  from  (3.1)  by 
taking 


. ,  „  12  (  t-f 


(3.3) 


and  imposing  the  restriction  J3  >  2 r.  The  assumption  (3.3)  is  tantamount  to 


oj(t)  = 


2 

,  =  exp 

sJ/W  -  2r) 


Vld  ~  2t)  ' 
Pr  [ 


P  =  2  t, 
ft  >  2r. 


(3.4) 


Proof  First,  (3.4)  is  implied  from  (3.2)  and  (3.3)  via  Laplace  transform  techniques  and, 
in  the  process,  we  find  >  2r  if  and  only  if  oj(t)  >  0  for  all  t  >  0,  which  is  necessary  for  oj 
to  be  a  probability  density.  Insertion  of  (3.3)  into  (3.1)  and  differentiating  with  respect  to  t 
yields 


utt(x,  t)  =  — ut(x,  t)  +  -  -  (w(y,  t)  -  u(x,  t))f(x  -  y)  d y 

T  PT  Jr 

and  thus,  upon  rearranging,  (1.8).  □ 

The  special  case  (3  =  2 r  implies  W  ~  Gamma(2,  r),  where  W  is  the  wait- time  random 
variable.  For  the  duration  of  the  paper,  we  restrict3  ourselves  to  f3  >  2r  so  that  we  have 
an  interpretation  from  a  CTRW  perspective.  This  restriction  has  appealing  consequences  as 
well,  e.g.,  positivity  of  solutions  and  conservation  of  mass.  We  note 


oj(s)  = 


firs2  +  2/3s  +  2 


=z 


k= 0 


(-D* 

k\ 


E(Wk)sk  =  1  -ps  +  o(s). 


which  shows  that  the  mean  wait- time  is  indeed  f3.  Recalling  also  (2.8),  we  compute  the  so- 
called  pseudo  mean  square  displacement  [17]  of  a  random  walker, 


rm 

J-Lit^a 


x2u(x ,  t)  dx  ~  t2/a, 


3 This  restriction  is  necessary  for  a  CTRW  interpretation,  but  might  be  relaxed  in  other  contexts. 
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which  is  the  mean  square  displacement  on  a  bounded  interval  that  grows  in  size  with  t.  When 
a  E  (0, 2),  2/or  >  1  and  (1.8)  is  thus  a  model  for  anomalous  superdiffusion  [17].  One  considers 
a  pseudo  mean  square  displacement  in  this  situation,  since  the  true  mean  square  displacement 
is  infinite  for  any  a  e  (0, 2).  In  the  special  case  when  a  =  2,  the  diffusion  is  not  anomalous. 


4.  The  Nonlocal  Cattaneo-Vernotte  Equation  on  Bounded  Domains.  The  results  in 
[11]  provide  a  variational  formulation  for  nonlocal  boundary  value  problems  for  (1.8).  This 
follows  closely  to  that  presented  for  the  nonlocal  diffusion  equation  (1.7)  in  [5].  Before  giving 
these  variational  formulations  and  describing  the  ensuing  finite  element  method,  we  establish 
some  notation. 

We  consider  the  bounded  domain  Q  =  (0, 1).  Define  the  bilinear  form 


Bj(u ,  v) 


1 

2 


,  t)  -  m(x,  t))(v(y)  -  v(x))  (/)£(x  -  y)  d y  dx, 


(4.1) 


where  I  e  (R,  (0, 1)}.  Let  V7  =  L2(7),  where 


L\I) 


|v|2  dx  <  oo 


and  Vi  denote  possible  choices  for  the  subspaces  of  test  and  trial  functions,  with 

Vr  :=  jv  e  Vr  v  |R\(0fl)=  °}  and  V(0,i)  :=  jv  e  V(0,i)  |  J"  vdx  =  J*  u0  dxj, 

where  u(x ,  0)  =  uq(x)  is  a  given  initial  density. 

The  nonlocal  homogeneous  Dirichlet  (/  =  R)  and  Neumann  (/  =  (0, 1))  boundary  value 
problems  for  (1.8)  are  presented  together:  Find  u  eVj  x  (0,  oo)  such  that 


H(u<y- 


ut(x,t )  +  -utt  cm) 


u(x ,  0)  =  Uo(x), 
Ut(x,  0)  =  0, 


t)  -  u(x ,  t))c/)£(x  -  y )  dy,  v  E  (0, 1), 

x  e  (0, 1), 

x  E  (0, 1). 


(4.2) 


We  present  a  useful  result  from  [9] . 

Theorem  4.1  (Emmrich  and  Weckner  (2006)).  Suppose 

-1  M 


kq  :=  esssup^  |Ar0(x)|  <  oo  and  k 


\K(x,y)\2  dydx  <  oo. 


For  a  given  uo  e  V7,  there  is  a  unique  mild  solution  u  E  C2([0,  T];  V/)  to 
2 

lt(X,  t )  +  utt(x,  t)  = 

Jo 


f  K(X, 
Jo 


y)u(y,  t)  dy  -  Kq(x)u(x,  t). 


We  remark  that  existence  and  uniqueness  of  solutions  to  (4.2)  follows  from  Theorem  4.1 
with 

2  C  2 

K(x ,  y)  :=  —  0(x  -  y)  and  K0(x)  =  —0(x  -  y)  dy. 

Ji 


rP 


The  variational  formulations  to  (4.2)  are:  Find  MEf/X  (0,  oo)  such  that 


I‘“'vdx+J' 


uttv  dx  H — Bi{u ,  v)  =  0, 

P 


Vve  V7, 


u(x ,  0)  =  Wo(x),  X  E  (0,  1), 

ut(x ,  0)  =  0,  X  E  (0, 1). 


(4.3) 
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We  refer  the  reader  to  [5,  11]  for  more  details  concerning  the  variational  formulations. 

The  nonlocal  Dirichlet  boundary  condition  constrains  the  field  u  on  R  \  (0, 1),  which 
is  analogous  to  the  classical  Dirichlet  boundary  condition  that  does  so  on  {0, 1}.  For  the 
nonlocal  Neumann  boundary  condition,  the  integral  in  (4.2)  is  only  over  (0, 1)  rather  than 
all  of  R.  This  constrains  diffusion  to  occur  only  inside  (0, 1),  i.e.,  density  neither  enters  nor 
exits  (0, 1),  which  is  analogous  to  the  classical  Neumann  boundary  condition.  Further,  since 
#(o,i) (u,  1)  =  0,  the  compatibility  condition  necessary  for  the  Neumann  problem  to  possess  a 
solution  is 

l 

u(x,  t)dx:=uo,  Vt  >  0,  (4.4) 

which  is  a  statement  that  the  integrated  quantity  u  is  conserved  for  all  time. 

Theorem  4.2.  Let  u  e  C2([ 0,  T];  Vj)  be  the  unique  solution  to  (4.2).  Then,  ut{x ,  t)  — ^0, 
as  t  — >-  oo,  for  almost  every  x  e  (0, 1). 

Proof  Multiply  (4.2)  by  ut(x ,  t ),  integrate  over  x  e  (0, 1),  and  then  integrate  in  t  to  obtain 


T-  r  u2(x,  t)  dx  =  —  (Bi(uq,uo)  -  Bi(u,u))  -  C  C  u2(x,s)dxds 
4  Ji  2/3  J0  Jj 


and  thus 


Bi(uq,  uq)  >  Bi(u ,  u)  +  2/3 


n 


u2t(x,  s)  dvd.s'  >  2/3 


X  Iu'<x ■ 


s)  dx  d.s'. 


Since  Bi(uq,uq)  <  oo,  ut(x,  t)  €  L2(I)  for  all  t  and 


\\Ut(x,t)\\2L2(I)  = 


X 


u2(x,  t )  dx 


0. 


The  completeness  of  L2(I)  implies  that  ut — >■ g  with  \\g\\^(i)  =  0,  i.e.,  g  =  0  almost  everywhere 
and,  thus,  ut  — >-  0  for  almost  every  xe  (0, 1).  □ 

A  stationary  solution  to  (4.2),  us  6  V7,  solves 


usiy)  ~  us(x))(f>(x  -y)dy  =  0, 


Vjc  e  (0, 1). 


The  results  in  [8,  11]  demonstrate  that  the  unique  stationary  solution  of  the  homogeneous 
Dirichlet  problem  is  us  =  0  and  that  of  the  homogeneous  Neumann  problem  is  us  =  uq. 
Consequently,  a  simple  corollary  to  Theorem  4.2  is  u(x ,  t)  — >-  us(x)  as  t  — >■  oo  for  almost 
every  r  e  (0, 1). 

4.1.  A  Semi-discrete  Finite  Element  Formulation.  To  formulate  the  finite  element 
method,  we  partition  (0, 1)  into  n  subintervals  Q;  and  let  Xi(x)  be  the  indicator  function  for 
£l[.  We  denote  the  space  of  piecewise  constant  functions  on  the  subintervals  Q/  by  Note 
any  Uh  e  ^  x  (0,  oo)  can  be  written 


n 

uh(x,  0  =  2]  yjMxjix)- 

7=1 


The  discrete  variational  problem  is  then:  Find  uh  €  Vt  ^  x  (0,  oo)  such  that 

My  +  =  -Ay, 
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where  M  and  A  are  the  mass  and  stiffness  matrices  defined  by 

n(f>e(x  -  y)  dy  dx,  i  +  j, 

u 


Mu  =  |a-|  and  Au  =  \  { 

P 


Jf 

[Jq,  Ji\n i 


(p£(x  -  y)  dy  dx,  i  =  j. 


For  the  Neumann  problem,  in  light  of  (4.4),  Uh  e  V^0  ^  x  (0,  ex?)  is  extracted  by  enforcing  that 

n 

y^yj(t)\£lj\  =u0. 

7=1 


5.  Numerical  Experiments  and  Examples.  In  this  section,  we  present  two  examples 
to  demonstrate  various  properties  of  numerical  solutions  of  the  nonlocal  Cattaneo-Vernotte 
equation  on  bounded  domains.  In  each  example,  is  defined  in  (2.5)  and  we  use  the  scaling 

J3  =  2t  =  csa,  (5.1) 


where  a  and  c  are  given  in  (2.8),  so  that  we  have  both  the  probabilistic  interpretation  in 
Section  3  and  a  relationship  to  fractional  diffusion  established  in  Section  2. 

The  first  example  examines  a  nonlocal  Cattaneo-Vernotte  equation  with  homogeneous 
Neumann  boundary  conditions  that  admits  an  analytic  solution  for  any  initial  condition.  We 
demonstrate  that  solutions  can  be  viewed  as  perturbations  of  solutions  to  the  corresponding 
nonlocal  diffusion  equation  (1.7)  and  we  investigate  the  effects  of  a  nonzero  relaxation  time. 
In  Example  2,  we  consider  the  discontinuous  initial  condition 


u0(x) 


(0,  0  <  v  <  0.5, 
1,  0.5  <  x  <  1. 


(5.2) 


and  investigate  the  effects  of  vanishing  relaxation  time  and  nonlocality,  i.e.,  letting  s — ^0,  on 
the  solutions  to  a  nonlocal  Dirichlet  boundary  value  problem.  We  use  Levy  stable  densities 
of  various  stability  indices  to  illustrate  the  relationship  to  classical  and  fractional  diffusion  in 
this  limit. 

Example  1.  Consider  the  nonlocal  homogeneous  Neumann  Cattaneo-Vernotte  equation 


£2  6  r1 

ut  +  —utt  =  —  ( u(y ,  t )  -  u(x,  t))(p£(x  -  y)  dy,  x  e  (0, 1), 

24  s1  Jo 


u(x,  0)  =  Uo(x), 
ut(x,  0)  =  0, 


X  e  (0, 1), 
x  €  (0,  1), 


(5.3) 


where 

(Ps(s)  =  ^ X(-s,s)(s ),  £  >  1, 

so  that  a  =  2,  c  =  1/6,  and,  consequently,  /3  =  s2 / 6  and  t  =  e2 / 12.  The  goal  of  this  example 
is  to  consider  the  case  of  increasing  e,  e.g.,  increasing  nonlocality. 

In  this  example,  since  e  >  1  and  supp(0(x  -  y))  contains  (0, 1)  for  all  x  e  (0, 1),  (5.3) 
reduces  to  an  ordinary  differential  equation  whose  solution  can  be  given  as  a  convex  combi¬ 
nation  of  the  initial  condition  uq(x)  and  the  constant  uq, 


uc{x,  t)  =  mq(  1  -  ic(t))  +  (c(t)uo(x). 


(5.4) 
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where 


&(t)  =  exp 


V 


+  cosh 


The  function  gc(t)  e  (0, 1]  is  a  strictly  decreasing  function  that  tends  to  zero  as  t  — >-  oo.  If 
uo(x )  =  uo  for  some  v  e  (0, 1),  then  v  is  a  fixed  point,  i.e.,  u(x,  t)  =  uq(x),  for  all  t  >  0. 
Also,  the  monotonicity  of  £c  implies  u(x ,  t)  uo  if  uo(x)  <  uo  and,  likewise,  u(x ,  t)  \  uo 
if  uo(x)  >  uo  as  t  — >-  oo.  As  s  — >-  oo,  gc(t)  — >■  1  for  any  fixed  t  <  oo.  Thus,  uc(x,  t)  can  be 
well-approximated  by  uo(x)  for  arbitrary  large  finite  time  by  choosing  s  sufficiently  large. 

In  [5],  it  was  shown  that  the  solution  of  (1.7)  with  homogeneous  Neumann  boundary 
conditions, 

(Mf  =  4  f  {u(y,t)-u{x,t))<pe{x-y)<\y,  x  €  (0, 1), 

ez  Jo  (5.5) 

u(x ,  0)  =  Uo(x),  X  E  (0,  1), 

for  the  same  as  in  (5.3),  is  also  given  by  a  convex  combination  of  uq(x)  and  uo , 

ud(x,  t)  =  U()(  I  -  &(0)  +  £d(t)uo(x),  (5.6) 


where 


£d(t)  =  exp|-^r 

Thus,  solutions  of  (5.3)  can  be  given  by 

uc(x,  t)  =  ud(x9 1)  +  (fc(0  -  £d(t))(uo(x)  -  uo), 

the  sum  of  the  solution  to  (5.5)  and  a  perturbation  (£c(t)  -  £d(t))(u o(v)  -  Uq)  due  to  a  nonzero 
relaxation  time.  Since  uo(x)  and  uq  are  fixed  for  a  given  initial  condition,  we  study  the  differ¬ 
ence  uc(x,  t)  -  ud(x ,  t)  simply  by  investigating  gc(t)  -  £d(t). 

In  Fig.  5.1,  we  plot  fc(f)  -  £d(t)  for  t  e  [0, 3]  and  s  e  [1, 3].  As  r— ^oo,  £.(*)  -  fj(0— ^0, 
but  more  slowly  for  increasing  e.  This  reflects  agreement  of  stationary  solutions  for  the  two 
problems.  For  small  values  of  t,  gc(t)  >  £d(t),  which  is  an  effect  of  the  nonzero  relaxation 
time.  After  this  short  time  frame,  gc(t)  -  £d(f)  =  0,  i.e.,  the  solutions  agree  exactly  at  some 
point  in  time  t  >  0,  and  then  gc(t)  <  £d(t)  for  the  duration  of  time.  These  observations  hold 
for  all  £,  but  are  less  dramatic  as  s  increases. 


(a)  s  =  1  and  t  e  [0, 3] 


(b)  ee  [1,3]  and  t  €  [0,3] 


(c)  b  =  5/4  and  t  e  [0, 3] 


Fig.  5.1.  The  vertical  axis  is  £c(t)  -  Q{t)  in  all  three  panels.  In  panels  (a)  and  (c)  the  horizontal  axis  is  t  e  [0, 3]. 
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(a)  a  =  2,s=  1/4 


(b)  a  =  2,s  =  1/8 


(c)  a  =  2,  s  =  1/16 


(d)  a  =  3/2,8  =  1/4 


(e)  a  =  3/2,s=  1/8 


(f)  a  =  3/2,8=  1/16 


Fig.  5.2.  Each  panel  shows  solutions  to  the  nonlocal  homogeneous  Dirichlet  problem  for  different  a  and  s.  The 
density  is  used,  where  fa  is  a  Levy  stable  density  with  index  of  stability  a.  Since  c  =  1,  we  take  2  r  =  sa.  The 
vertical  axis  in  each  panel  is  the  value  of  u^ix,  t )  and  the  horizontal  axis  is  x.  The  ten  different  solution  profiles  in 
each  panel  correspond  to  the  solutions  at  ten  different  times,  t  e  [0, 0.25]. 


Example  2.  The  fractional  diffusion  behavior  of  boundary  value  problems  for  (1.8)  is  exam¬ 
ined  by  choosing  0  =  to  be  a  symmetric  and  centered  Levy  stable  density  with  stability 
index  a  e  {2, 3/2, 1, 1/2}.  As  explained  in  Section  2,  a  represents  the  fraction  of  the  Lapla- 
cian  in  the  equations  (1.6)  and  (2.9).  Such  Levy  stable  densities,  normalized  so  that  c  -  1, 
are  characterized,  via  the  Levy-Khintchine  representation,  through  their  Fourier  transforms, 
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i.e., 


4>“(s)  =  r-'(cxp(-m)(s), 


(5.7) 


see  [1,  §§  1.2.5].  We  use  (2.5)  to  define  0J?  and  the  cases  a  =  2, 1  yield  closed-form  expres¬ 
sions  for  0^ : 


a  =  2, 


,7l(s2  +  £2)  ’ 


which  are  Gaussian  and  Cauchy  densities,  respectively.  Although  other  values  of  a  do  not 
admit  a  closed-forms  for  they  can  be  estimated  by  approximating  (5.7).  Regardless  of 
a,  (f)a  is  symmetric  and  unimodal.  For  a  <  2,  the  second  moment  is  infinite  and  for  a  <  1,  all 
moments  are  infinite. 

Fig.  5.2  plots  the  time-evolutions  of  the  approximate  solutions  to  the  nonlocal  homoge¬ 
neous  Dirichlet  boundary  value  problem  described  in  (4.2)  given  by  the  finite  element  method 
with  mesh  spacing  h-  5  •  10-4  and  t  e  [0, 0.25].  We  consider  a  e  {2,  3/2, 1, 1  /2}  and  various 
s.  The  solutions  of  with  0 2  behave  asymptotically,  with  respect  to  £,  as  solutions  to  the  clas¬ 
sical  diffusion  equation  (1.4).  However,  the  asymptotic  behavior  of  solutions  of  with  0^2  is 
given  by  a  fractional  Laplace  parabolic  equation  (1.6).  Consequently,  the  magnitude  of  the 
jump  discontinuity  in  the  initial  data  decays  more  slowly  in  these  latter  cases. 
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A  DISCONTINUOUS  VELOCITY  LEAST  SQUARES  FINITE  ELEMENT  METHOD 
FOR  THE  STOKES  EQUATIONS  WITH  IMPROVED  MASS  CONSERVATION 

JAMES  LAI§,  PAVEL  BOCHEV'1,  LUKE  OLSON11,  KARA  PETERSON**  DENIS  RIDZAL’’  AND  CHRIS 

SIEFERT^ 

Abstract.  Conventional  least  squares  finite  element  methods  (LSFEM)  for  incompressible  flows  conserve  mass 
approximately.  In  some  cases,  this  can  lead  to  an  unacceptable  loss  of  mass  and  unphysical  solutions.  In  this  report 
we  formulate  a  new,  locally  conservative  LSFEM  for  the  Stokes  equations  which  computes  a  discrete  velocity  field 
that  is  point-wise  divergence  free  on  each  element.  To  this  end,  we  employ  discontinuous  velocity  approximations 
which  are  defined  by  using  a  local  stream-function  on  each  element.  The  effectiveness  of  the  new  LSFEM  approach 
on  improved  local  and  global  mass  conservation  is  compared  with  a  conventional  LSFEM  employing  standard  C° 
Lagrangian  elements. 

1.  Introduction.  Least-squares  finite  element  methods  (LSFEMs)  have  been  applied 
to  incompressible  flows  with  varying  success.  The  key  issue  is  that  LSFEMs  are  residual 
minimization  schemes  and  hence  conserve  mass  only  approximately.  For  some  problem  con¬ 
figurations,  this  can  lead  to  an  unacceptable  loss  of  mass  and  unphysical  solutions.  A  locally 
conservative  mimetic  LSFEM  has  been  defined  for  the  Stokes  equations  in  [4]  and  [3,  Sec¬ 
tion  7.7]  using  compatible  finite  element  spaces.  However,  the  mimetic  LSFEM  requires 
non-standard  boundary  conditions  specifying  the  normal  velocity  and  the  tangential  vorticity 
on  the  domain  boundary.  So  far,  it  has  not  been  extended  to  the  more  common  and  practically 
important  velocity  boundary  condition  and  it  is  not  clear  whether  or  not  this  can  be  done. 

Mass  conservation  in  least  squares  methods  for  the  Stokes  equations  with  the  velocity 
boundary  condition  has  been  studied  extensively  in  literature  [6,  9,  10,  13,  14,  15].  Loss 
of  mass  in  LSFEMs  can  be  countered  by  mesh  refinement  [13],  high  order  elements  [16], 
modifying  the  least-squares  functional  [15],  weighting  the  continuity  equation  more  strongly 
[10],  or  by  enforcing  it  on  each  element  by  Lagrange  multipliers  [9].  However,  neither  one 
of  these  approaches  can  be  deemed  completely  satisfactory. 

Mass  conservation  does  not  improve  proportionally  with  mesh  refinement-leading  to 
an  impractical  alternative.  High  order  elements  require  an  increased  amount  of  storage  and 
computation  and  the  improvements  to  mass  conservation  are  not  commensurate  with  the  ad¬ 
ditional  cost  [13,  15].  Modifying  the  least  squares  functional  with  terms  that  promote  mass 
conservation  has  proven  to  be  very  successful  [15],  however,  it  is  an  ad  hoc  way  of  improving 
mass  conservation  and  may  depend  on  the  problem  on  hand.  Another  alternative  is  to  enforce 
element-wise  mass  conservation  using  Lagrange  multipliers  [9].  While  this  approach  yields 
exact  mass  conservation  on  each  element,  it  also  results  in  a  saddle-point  system,  thereby 
negating  the  main  reason  one  may  want  to  consider  LSFEMs. 

An  idea  that  has  not  been  explored  much  in  the  context  of  LSFEMs  is  the  use  of  dis¬ 
continuous  elements.  Discontinuous  LSFEM  can  be  viewed  as  generalizations  of  LSFEMs 
for  transmission  and  mesh-tying  problems;  see  [1],  [3,  Section  12.10]  and  [8],  from  a  fixed 
number  of  subdomains  to  an  arbitrary  number  of  subdomains. 

In  this  report  we  formulate,  in  two  stages,  a  new  locally  conservative  LSFEM  for  the 
Stokes  equations  with  the  velocity  boundary  condition  by  using  discontinuous  velocity  ap¬ 
proximations.  Our  starting  point  is  a  weighted  L 2  least-squares  formulation  [2]  employing 

^University  of  Illinois  at  Urbana-Champaign,  Department  of  Computer  Science,  jhlai2@illinois.edu 
^Sandia  National  Laboratories,  pbboche@sandia.gov 
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conventional  C°  elements  and  the  velocity- vorticity -pres sure  (VVP)  form  of  the  Stokes  equa¬ 
tions.  The  first  stage  relaxes  the  continuity  of  the  velocity  field  only  and  adds  new  terms 
which  penalize  the  normal  and  the  tangential  jumps  of  the  velocity  across  the  element  inter¬ 
faces.  We  show  that  by  adjusting  the  relative  importance  of  the  normal  and  tangential  jump 
terms  this  intermediate  discontinuous  velocity  LSFEM  can  lead  to  noticeable  improvements 
in  the  mass  conservation.  However,  the  weights  required  for  improved  mass  conservation 
differ  from  problem  to  problem,  thereby  making  this  formulation  insufficiently  robust  for 
practical  problems. 

At  the  second  stage,  we  proceed  to  define  the  discontinuous  velocity  field  on  each  el¬ 
ement  as  the  curl  of  a  local  stream- function.  This  guarantees  that  the  velocity  is  pointwise 
divergence  free  on  each  element.  Thus,  our  approach  can  be  interpreted  as  implementation 
of  the  intermediate  discontinuous  velocity  LSFEM  using  locally  divergence  free  basis  for  the 
velocity.  This  idea  bears  some  similarity  with  the  discrete  LSFEM  in  [7]  with  two  crucial  dis¬ 
tinctions.  First  and  foremost,  the  method  in  [7]  is  not  a  discontinuous  formulation;  in  order  to 
cope  with  the  discontinuity  in  the  approximating  space  this  method  replaces  the  differential 
operators  by  weak  discrete  versions  defined  using  integration  by  parts.  The  second  distinc¬ 
tion  is  that  we  eliminate  completely  the  velocity  and  work  directly  with  the  stream  function, 
whereas  [7]  retains  the  original  fields. 

The  resulting  discontinuous  stream-function-vorticity-pressure  (SVP)  LSFEM  is  locally 
conservative  and  offers  a  much  improved  global  and  local  mass  conservation  compared  to  its 
parent  LSFEM  employing  C°  elements.  We  demonstrate  the  usefulness  of  the  new  formula¬ 
tion  through  a  series  of  numerical  examples. 

1.1.  Notation.  For  simplicity  we  restrict  attention  to  two  space  dimensions  and  bounded, 
simply  connected  regions  Qcl2  with  Lipschitz-continuous  boundary.  In  what  follows  we 
use  the  standard  notation  Hk(Q)  for  the  Sobolev  space  of  all  square  integrable  functions  which 
have  square  integrable  derivatives  of  orders  up  to  k.  The  norm  and  inner  product  on  Hk  are 
||  •  II*;  and  (•,  ■ )k ,  respectively. 

As  usual,  when  k  -  0  we  write  L2(D),  (•,  •)  and  ||  •  ||o-  The  symbol  //q(D)  denotes  a 
subspace  of  Hl(£l)  of  functions  whose  trace  vanishes  on  <9D  and  L2(D)  is  the  subspace  of  L 2 
fields  with  vanishing  mean.  The  dual  of  //q(D)  is  the  space  //_1(D)  with  norm 


IMI-i  =  sup 

vetfjcn) 


v) 

Nil  ' 


(1.1) 


Vector  valued  fields  and  their  associated  function  spaces  are  denoted  by  bold  face  sym¬ 
bols,  e.g.,  u  =  (u\,u2)  is  a  vector  field  in  two  dimensions  and  H^D)  is  the  Sobolev  space  of 
vector  fields  with  components  are  in  F6(D).  In  two  dimensions,  the  curl  is  defined  for  scalar 
and  vector  functions  as 


V  x  oj  = 


OJy 

-Ux 


V  X  U  =  U2x  -  VLiy  . 


(1.2) 


We  use  7C  to  denote  a  partition  of  D  into  finite  elements  K.  In  two  dimensions  K  can  be  a 
quadrilateral  or  a  triangle  and  the  interface  between  two  elements  is  an  edge  e.  The  sets  of  all 
interior  and  boundary  edges  in  the  mesh  are  denoted  by  6(D)  and  6(T),  respectively.  Finally, 
6  =  6(D)  U  6(T)  is  the  set  of  all  edges  in  the  mesh. 

The  standard  C°  finite  element  spaces  of  degree  r  >  0  on  quadrilateral  and  triangular 
grids  are  denoted  by  Qr  and  Pr,  respectively.  We  will  also  need  their  discontinuous  versions 
[ Qr ]  and  [ Pr ].  When  the  type  of  the  element  is  not  important  we  write  Rr  and  [7?r]  with  the 
understanding  that  Rr  =  Qr  on  quadrilaterals  and  Rr  =  Pr  on  triangles. 
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Discontinuous  finite  element  methods  require  various  jump  terms  on  element  interfaces. 
Let  K+  and  K_  be  two  adjacent  elements  that  share  edge  e  and  denote  the  velocities  on  each 
element  by  u+  and  u_  respectively.  Define  the  jump  in  normal  and  tangential  components 
across  e  as 


[u  •  n]  =  u+  •  n+  +  u  •  n  ,  [u  x  n]  =  u+  x  n+  +  u  x  n  .  (1.3) 

where  n+  and  n“  are  the  outer  normals  on  dK+  and  dK~  respectively.  The  jump  of  a  scalar 
function  is  defined  as  usual  by  the  difference 

M  =  (i.4) 


2.  The  continuous  prototype  least-squares  method.  In  this  section  we  review  the 
weighted  L 2  least- squares  method  for  the  Stokes  equations  which  is  the  prototype  for  the 
discontinuous,  locally  conservative  LSFEM.  In  terms  of  the  primitive  variables  the  governing 
equations  assume  the  form 


-AU  +  V/7  =f 

on  Q 

<1 

e 

ii 

o 

on  Q 

(2.1) 


where  u  and  p  are  the  velocity  and  the  pressure,  respectively,  and  f  is  a  given  vector  func¬ 
tion  specifying  the  body  force.  The  system  (2.1)  is  augmented  with  the  velocity  boundary 
condition 


u  =  0  on  d£l  (2.2) 

and  the  zero  mean  pressure  constraint 

f  pda  =  0 .  (2.3) 

Jq 

The  first  equation  in  (2.1)  governs  conservation  of  momentum  while  the  second  (continuity 
equation)  governs  conservation  of  mass. 

Least-squares  methods  for  (2.1),  (2.2)  and  (2.3)  are  usually  defined  using  an  equivalent 
first-order  form  of  the  Stokes  equations.  This  eliminates  the  need  for  globally  //2 -conforming 
finite  elements  which  require  C1  continuity  and  are  difficult  to  construct.  There  are  several 
first  order  formulations  of  the  Stokes  equations  to  choose  from,  the  most  common  being  the 
velocity-vorticity-pressure  formulation  in  which  the  vorticity 

oj  =  V  x  u  (2.4) 

is  introduced  as  a  new  variable.  Using  the  identity, 

V  x  V  x  u  =  -Au  +  V(V  •  u)  (2.5) 

and  the  continuity  equation,  we  arrive  at  the  velocity-vorticity-pressure  (VVP)  first  order 
formulation  of  the  Stokes  equations 


V  x  oj  +  Vp 

=  f 

on  Q 

w-Vxu 

=  0 

on  Q 

(2.6) 

V  u 

=  0 

on  Q 

The  VVP  system  is  augmented  with  the  velocity  boundary  condition  (2.2)  and  the  zero  mean 
constraint  (2.3). 
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2.1.  Weighed  L2  least-squares  method.  LSFEMs  define  unconstrained  minimization 
problems  via  residual  minimization  over  an  appropriate  Hilbert  space.  Thus,  the  LSFEM 
solution  is  given  by  the  solution  to  the  minimization  of  a  norm-equivalent  functional  devised 
using  the  squares  of  the  residuals  of  each  equation  of  the  partial  differential  equation  in  the 
appropriate  norm.  The  resulting  discretized  system  that  minimizes  the  functional  over  the 
finite  element  subspace  is  guaranteed  to  be  symmetric  and  positive  definite. 

One  can  show  that  for  the  VVP  system  with  velocity  boundary  conditions,  the  negative 
norm  functional 

w,  p;  f)  =  ||V  x  o>  +  Vp  -  f||ij  +  ||V  x  u  -  o)\\ l  +  ||V  ■  u||q  (2.7) 

is  norm  equivalent  on  X_i  =  [77*  (Q)]2  x  L2(D)  x  L2(Q).  A  least  squares  principle  for  (2.6)  is 
to  minimize  (2.7)  over  X_\. 

The  negative  norm  (1.1)  admits  the  following  characterization  [3]. 

Theorem  2.1.  For  any  u  e  77_1(D) 


\\u\tx  =  ||(-a)-1/2m||q  (2.8) 

This  theorem  reveals  that  the  negative  norm  is  not  easily  computable  because  it  requires 
inversion  of  the  Laplace  operator.  Therefore,  to  obtain  a  practical  LSFEM  it  must  be  approx¬ 
imated.  The  diagonal  operator 


(- A)-1/2  i->  hi  (2.9) 

gives  a  simple,  yet  sufficiently  accurate  for  our  purposes  approximation  of  the  negative  norm 
[3].  Using  (2.9)  the  first  term  of  (2.7)  is  approximated  by 

||V  x  to  +  Vp  -  f||2!  *  h2  ||V  x  to  +  Vp  -  f||2  (2.10) 

We  arrive  at  the  following  discrete  version  of  (2.7) 

Jh_Juh,toh,ph;f)  = 

7  11  h  h  ,,2  ,,  h  m,2  m  mi2  (2.11) 

h2  ||v  xtoh  +  Vph  -  f||2  +  ||v  xu‘- ioh\\0  +  ||v  •  uh\\0 

where  (uh,toh,ph)  e  Xhr  =  [Rr(Q.)  n  //^)]2  x  Rr. X(Q)  n  H\Q)  x  R,-t  n  L2(Q),  r  >  1.  We 
refer  to  (2.11)  as  the  weighted  L2  method1  since  it  is  composed  of  L2  norms  of  the  squares 
of  the  residuals  of  each  equation  scaled  by  an  approximate  mesh  weight.  In  what  follows  we 
restrict  attention  to  the  lowest-order  admissible  space,  i.e.,  r  =  2. 

One  can  show  that  (2.11)  is  well-posed  and  optimally  convergent  formulation  [3].  In 
particular,  the  following  theorem  holds  [3]. 

Theorem  2.2.  Let  (uh,  a>h,  ph )  e  be  a  solution  to  (2.11),  and  (u,  p,h)  e  X  be  the  exact 
solution  to  (2.6),  such  that  u  e  H3(0),  co  e  H2(Q)  and  p  e  H2(Q).  There  exists  a  constant 
C  >  0  such  that 


|u  -  u‘|f  +  \\co  -  co\  +  ||p  -  p\  <  Ch2  (INIs  +  IMh  +  ||p||2)  -  (2-12) 


1  For  simplicity,  in  our  implementation  of  the  weighted  method  the  one  dimensional  nullspace  of  the  pressure 
is  eliminated  by  setting  the  pressure  on  the  boundary  to  zero  at  one  point  instead  of  enforcing  (2.3).  These  two 
approaches  to  handling  the  one  dimensional  nullspace  of  the  pressure  are  equivalent;  however,  the  choice  affects  the 
convergence  of  the  iterative  method  used  to  solve  the  system.  A  comparison  can  be  found  in  [3]. 
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From  Theorem  2.2,  we  see  that  using  a  quadratic/biquadratic  approximation  for  the  ve¬ 
locity  and  linear/bilinear  approximations  for  the  vorticity  and  pressure  result  in  optimal  con¬ 
vergence  rates.  Nonetheless,  for  simplicity  we  work  the  equal-order  version  of  the  finite 
element  space 

Xh2  =  (/?2(Q)  n  //q(O))2  x  r2(Q)  n  Hl(Q)  x  r2  n  L20(Q) .  (2.13) 


We  use  (2.1 1)  as  a  basis  for  our  discontinuous  velocity  LSFEM. 

2.2.  Mass  conservation  in  the  weighted  L 2  least-squares  method.  Theorem  2.2  as¬ 
serts  that  the  weighted  L 2  method  is  optimally  accurate  for  all  sufficiently  smooth  exact  so¬ 
lutions  of  the  Stokes  equations.  This  means  that  asymptotically  ||V  •  u\\  —>  0,  as  h  —>  0. 
However,  on  a  given  fixed  mesh  size  this  term  cannot  be  guaranteed  to  be  as  small  as  may 
be  required,  nor  could  its  convergence  to  zero  be  assured  for  insufficiently  smooth  velocity 
fields.  In  this  section  we  show  that  these  concerns  are  not  unfounded  and  that  in  some  cases 
mass  loss  in  the  weighted  least-squares  method  can  be  significant. 

To  this  end  we  consider  two  standard  test  problems:  the  backward- facing  step  flow, 
shown  in  Fig.  2.1,  and  a  channel  flow  past  a  cylinder,  shown  in  Fig.  2.2.  For  the  backward¬ 
facing  step  problem  the  domain  is  the  rectangle  [0, 10]  x  [0, 1]  with  a  reentrant  corner  at 
(2, 0.5).  The  velocity  boundary  condition  is  specified  as  follows.  On  the  inflow  (x  =  0)  and 
outflow  (x  =  10)  walls 


U  in 


8(y  -  0.5)(1  -y) 
0 


and 


U out  ~ 


yd-y) 

o 


(2.14) 


respectively.  Along  all  other  portions  of  the  boundary  uwau  =  0  is  enforced. 

The  geometry  of  the  second  test  problem  is  given  by  the  rectangle  [-1, 3]  x  [-1,  -1]  with 
a  disk  of  radius  r  -  0.6  centered  at  (0, 0),  removed  from  the  domain.  The  velocity  boundary 
condition  for  this  problem  is  set  as  follows.  On  the  inflow  (x  =  -1),  outflow  (x  =  3),  top 
(y  =  1)  and  bottom  (y  =  -1)  sides 


—  U out  —  ^wall  — 


(1-30(1+30 

0 


(2.15) 


and  on  the  surface  of  the  “cylinder”  u cyi  =  0.  Therefore,  velocity  is  set  to  zero  on  all  parts  of 
the  boundary  except  for  the  inflow  and  the  outflow  portions  of  d£l. 

Note  that  in  both  test  problems  specification  of  the  velocity  boundary  condition  is  com¬ 
patible  with  V  •  u  =  0  because  fluid  enters  and  leaves  the  domain  only  through  the  inflow  and 
the  outflow  boundaries,  respectively  and 


(2.16) 


To  assess  the  mass  conservation  properties  of  the  least-squares  methods  considered  in  this 
report  we  measure  the  total  mass  flow  across  several  vertical  surfaces  connecting  the  top  and 
the  bottom  sides  of  the  computational  domains.  The  lines  marked  by  “S”  in  Figures  2. 1-2.2 
show  two  typical  examples  of  such  surfaces  for  the  two  test  problems.  Because  the  greatest 
mass  loss  for  the  backward-facing  step  is  expected  near  the  reentrant  comer  we  always  place 
one  of  the  surfaces  at  v  =  2.  For  the  second  test  problem  we  always  measure  the  flow  across 
the  surface  at  v  =  0  where  the  domain  narrows  due  to  the  cylindrical  cutout. 

Because  for  both  test  problems  velocity  is  zero  on  all  parts  of  the  boundary  except  Tin 
and  Tout ,  from  the  divergence  theorem  it  follows  that 


u  •  n in  dl  - 


I 


u  •  n s  d( . 


(2.17) 
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Fig.  2.1.  Geometry  of  the  first  test  problem:  backward-facing  step. 


1.0  'Uwall 


Fig.  2.2.  Geometry  of  the  second  test  problem:  flow  past  a  cylinder. 


Therefore,  mass  conservation  can  be  quantified  by  the  precent  mass  loss  across  the  surface  S , 
defined  as  follows: 


moss 


J*  u  •  nin  dl  -  J*  u  •  ns  df 


u  •  nin  df 


(2.18) 


To  assess  mass  conservation  properties  of  the  weighted  L 2  formulation  we  solve  the  two 
test  problems  using  the  following  modified  version  of  the  weighted  L 2  least-squares  func¬ 
tional 


JhJuh,oj\ph;ih)  = 

,  l|2  ||  .  ,  „2  „  ,  „2  (2.19) 

h2  ||V  X  (jl/1  +  Vph  -  fh\\0  +  ||v  x  —  (J1  \\0  +  p  ||v  •  Uh\\0 

implemented  using  the  equal  order  space  (2.13).  This  modification  has  been  proposed  in 
[10]  as  a  way  to  improve  mass  conservation  in  least-squares  methods.  By  increasing  p  we 
increase  the  relative  importance  of  the  residual  of  the  continuity  equation,  thereby  promoting 
mass  conservation.  In  our  study  we  use  p  =  1,  p  =  10  and  p  =  20. 

Our  results  are  summarized  in  Figure  2.3.  We  see  that  for  p-  1  the  least-squares  solution 
of  the  backward-facing  step  problem  experiences  severe  mass  loss  in  excess  of  50%  of  the 
total  mass  near  the  reentrant  corner.  Increasing  p  does  improve  conservation,  however,  mass 
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Fig.  2.3.  Percent  mass  loss  of  (2.19)  for  the  backward-facing  step  ( left  panel )  and  the  flow  past  a  cylinder  ( right 
panel)  test  problems. 


loss  remains  unacceptably  high  even  for  p  =  20.  We  note  that  significant  increase  of  p  is 
not  recommended  as  this  will  reduce  the  accuracy  of  the  other  terms  in  the  functional  and 
compromise,  e.g.,  conservation  of  momentum.  Indeed,  by  increasing  the  weight  of  a  single 
term  in  the  least  squares  functional,  it  is  in  effect  decreasing  the  weight  of  the  other  terms. 
Thus,  by  choosing  a  large  weight  for  p  to  promote  mass  conservation,  we  are  effectively 
demoting  conservation  of  momentum.  The  mass  loss  in  the  second  test  problem  is  not  as 
severe  but  still  significant  at  7%.  In  this  case,  setting  p  =  20  helps  to  bring  down  the  loss  of 
mass  across  the  narrowings  to  about  2%. 

Remark  1 .  Exact  element-wise  mass  conservation  with  C°  elements  has  been  achieved 
in  the  so-called  restricted  least-squares  method  [9].  In  the  restricted  LSFEM  mass  conserva¬ 
tion  on  each  element  is  added  as  an  explicit  constraint  leading  to  the  following  constrained 
minimization  problem: 

min 

xhr 

subject  to 

Although  (2.20)  returns  a  solution  with  exact  element-wise  mass  conservation,  the  system 
is  typically  solved  using  Lagrange  multipliers  and  results  in  a  saddle-point  system  which 
negates  the  advantages  of  using  least- squares  in  the  first  place.  The  constrained  optimization 
problem  can  also  be  solved  by  a  penalty  approach,  in  which  case  one  is  led  back  to  a  for¬ 
mulation  similar  to  (2.19)  with  a  very  large  p.  Because  the  penalty  must  be  strong  enough 
to  enforce  the  constraint  accurately,  the  penalty  formulation  of  (2.20)  suffers  from  the  same 
disadvantages  as  (2.19). 

In  the  next  section  we  explore  an  alternative  approach  to  improve  mass  conservation  in 
least-squares  methods  based  on  allowing  discontinuous  velocity  spaces  in  the  formulation. 

3.  Discontinuous  velocity  least-squares  finite  element  method.  Numerical  results  in 
the  last  section  show  that  least- squares  methods  with  C°  elements  can  suffer  from  severe 
mass  loss  which  in  some  cases  may  exceed  50%  of  the  total  mass.  Furthermore,  the  remedies 
available  to  counter  this  loss  are  not  satisfactory:  weighting  strongly  the  continuity  equation 
residual  as  in  (2.19)  reduces  conservation  of  momentum,  while  using  the  restricted  formula¬ 
tion  (2.20)  leads  to  a  saddle-point  problem  and  negates  the  advantages  of  least- squares. 

The  option  of  using  div-conforming  elements  to  achieve  exact  mass  conservation  in  least- 
squares  methods  has  been  explored  in  [4] .  However,  the  resulting  mimetic  LSFEM  requires 


J  ( V-uh)dK  =  0 ,  VKe<K 


(2.20) 
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non-standard  boundary  conditions  for  the  Stokes  equations,  and  its  extension  to  the  practically 
important  velocity  boundary  condition  is  not  clear. 

Consequently,  in  order  to  improve  mass  conservation  in  LSFEMs  for  the  Stokes  equation 
with  the  velocity  boundary  condition  we  propose  to  employ  a  discontinuous  finite  element  ap¬ 
proximation  of  the  velocity,  while  retaining  C°  elements  for  the  rest  of  the  variables.  In  so 
doing  we  achieve  two  objectives.  First,  we  keep  the  growth  of  the  degrees  of  freedom  to  a 
minimum,  compared  to  a  fully  discontinuous  formulation.  Second,  relaxation  of  the  interele¬ 
ment  continuity  of  the  velocity  space  allows  a  greater  flexibility  in  the  choice  of  the  local 
finite  element  approximation  of  that  variable.  In  particular,  it  becomes  possible  to  consider 
locally  divergence-free  spaces  which  would  have  been  impractical  if  the  global  velocity  space 
had  to  be  Hl  -conforming. 

Following  these  ideas  we  develop  a  discontinuous  velocity  least-squares  finite  element 
method  based  on  the  well-posed  formulation  (2.1 1)  in  two  stages.  At  the  first  stage  we  allow 
discontinuous  finite  elements  for  the  velocity  in  (2.11),  i.e.,  we  set 

X*  =  ([/?],. )2  x /?r_,  x .  (3.1) 


This  necessitates  some  changes  in  the  least-squares  functional,  namely,  the  last  two  terms 
have  to  be  broken  into  element  sums  to  deal  with  the  loss  of  conformity  in  the  velocity  space: 

Jh_l(uh,uh,ph;fh)  = 

h2  ||  V  x  <oh  +  Wph  -  fhf0  +  (||V  x  u*  -  cohfQK  +  ||V  •  uA||o  2  (3'2) 


Furthermore,  to  obtain  a  well-posed  formulation  with  a  unique  solution,  we  need  to  recover 
some  of  the  Z/1 -conformity  qualities  of  the  velocity.  Therefore,  constraints  on  the  jumps  in 
normal  and  tangential  components  of  the  velocity  are  introduced. 

Recall  that  8(Q)  is  the  set  of  all  interior  edges  in  the  mesh.  It  is  easy  to  see  that  the 
weighted  L 2  least-squares  method  (2.1 1)  is  equivalent  to  the  following  constrained  minimiza¬ 
tion  problem 


mmJh_l(uh,coh,ph-,fh) 


subject  to 


JV-", 


]dl  =  0  and 


f[u>  x  n  i]d£  = 


0  Mei  €  6(0) 


(3.3) 


The  constrained  system  can  be  solved  by  Lagrange  multipliers  in  which  case  the  resulting 
minimization  problem  becomes 


min  max?, (u  h,coh,  ph;fh)  - 

Xh  R|£| 


Z  A\  J  [u,  •  nt]d{  - 

eiES(Q)  J ei 


L?  I  [u,  x  njd£ 

eteS  Je‘ 


(3.4) 


Of  course,  similar  to  (2.20),  this  formulation  is  a  saddle-point  system  that  gives  rise  to  an 
indefinite  algebraic  system. 

Instead  of  using  Lagrange  multipliers  we  will  encourage  Hl  conformity  by  a  penalty 
approach-by  adding  residuals  of  the  interelement  jumps  to  the  least-squares  functional.  This 
gives  rise  to  the  following  discontinuous  velocity  functional: 

Jh_l(uh,a>h,ph;fh)  = 

h2  | V  x  of  +  V  -  fio  +  Z  (HV  x  u*  -  At U  +  HV  •  ulo, k) 

Ke<K 

+h~l  ^  iai  I  [u  •  n i\2di  +  a2  I  [u  x  n/]2J/| 

gfG6(n) '  ^ ei  ^ ei  ' 


(3.5) 
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where  oq,  >  0  are  penalty  parameters.  The  values  of  these  constants  can  be  used  to  adjust 
the  relative  importance  of  normal  vs.  tangential  continuity. 

Based  on  analogies  with  div-conforming  elements,  one  could  argue  that  strengthening 
the  normal  continuity  of  the  velocity  field  should  lead  to  improved  mass  conservation  in 
the  finite  element  solution  of  (3.5).  If  this  were  the  case,  then  the  discontinuous  velocity 
formulation  (3.5)  with  oq  »  oq  should  be  able  to  take  care  of  the  mass  losses  in  our  two  test 
problems.  To  test  this  hypothesis  we  implement  (3.5)  using  the  equal-order  discontinuous 
velocity,  continuous  vorticity  and  pressure  finite  element  space 

[Xh2]  =  ([R2])2xR2xR2  (3.6) 

and  solve  the  two  test  problems  with  two  different  choices  for  oq  and  oq .  The  first  choice  is 
to  set  a i  =  a2  =  100,  in  which  case  we  expect2  to  see  mass  losses  comparable  to  that  in  the 
C°  formulation.  The  second  set  of  weights  is  oq  =  100,  oq  =  0.01  emphasizes  normal  over 
tangential  continuity.  If  our  hypotheses  were  correct,  this  set  of  weights  would  lead  to  a  much 
improved  mass  conservation. 

Unfortunately,  the  results  shown  in  Fig.  3.1  refute  our  seemingly  logical  hypothesis.  The 


Fig.  3.1.  Percent  mass  loss  in  the  discontinuous  velocity  least-squares  method  (3.5)  for  the  backward-facing 
step  ( left  panel)  and  the  flow  past  a  cylinder  (right  panel)  test  problems.  Green  line  corresponds  to  a\  =  «2  -  100, 
red  line  corresponds  to  a i  =  100,  «2  =  0.01  and  the  blue  line  gives  the  reference  mass  loss  by  the  C°  least-squares 
method  (2.11).  The  legend  values  are  read  as  a\,  a2  with  (2.11)  as  reference  labeled  (Co). 


left  panel  in  the  figure  shows  that  for  the  backward-facing  step  problem  the  second  weight 
combination  does  lead  to  a  significant  improvement  in  the  mass  conservation  by  reducing  the 
mass  loss  from  over  50%  to  just  a  bit  over  3%.  However,  for  the  flow  past  a  cylinder  the 
situation  is  completely  reversed.  Now  the  choice  oq  =  100,  oq  =  0.01  leads  to  a  significant 
deterioration  of  the  mass  conservation  and  increases  mass  loss  from  7%  in  the  C°  formulation 
to  nearly  90% !  These  results  clearly  indicate  that  the  discontinuous  velocity  formulation  (3.5) 
cannot  be  reliably  counted  on  to  always  reduce  the  mass  loss  with  the  same  choice  of  weights, 
i.e.,  its  mass  conservation  properties  are  problem  dependent.  This  is  an  undesirable  feature 
that  we  shall  deal  with  at  the  second  stage  of  the  formulation  of  our  new  method. 

To  motivate  this  stage  we  note  that  while  discontinuous  velocity  does  allow  for  improve¬ 
ments  in  mass  conservation,  the  least-squares  formulation  (3.5)  does  not  enforce  exact  mass 
conservation  on  each  element.  At  the  same  time,  considering  that  the  velocity  space  is  not 


2This  is  because  in  the  limit  as  a\  — >  oo  and  012  00  >  (3.5)  recovers  the  C°  solution  of  the  weighted  L 2  LSFEM 

method. 
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subject  to  any  interelement  continuity,  it  is  obvious  that  we  have  a  greater  flexibility  in  choos¬ 
ing  the  velocity  representation  on  each  element  than  in  the  C°  setting.  In  particular,  we  can 
take  advantage  of  this  flexibility  by  choosing  the  velocity  to  be  pointwise  divergence  free  on 
each  element  by  setting 

uh\K  =  Vxif/l\K  VKe<K,  (3.7) 


where  if/h  €  [R]r  is  a  discontinuous  stream  function.  Therefore,  at  the  second  stage  we  replace 
the  velocity  field  in  (3.5)  with  the  field  defined  in  (3.7).  Note  that  when  defining  in  this 
way,  V  •  =  0  is  automatically  satisfied  and  hence  the  residual  of  the  continuity  equation  can 

be  dropped  from  the  least- squares  functional.  However,  a  term  that  penalizes  the  jump  of  the 
stream  function  must  be  added  to  the  functional.  Furthermore,  because  velocity  is  eliminated, 
the  velocity  boundary  condition  must  be  implemented  through  the  stream  function.  It  is  easy 
to  see  that  n-Vxf  involves  only  tangential  derivatives  of  if/.  Therefore,  a  Dirichlet  boundary 
condition  on  the  stream- function  specifies  the  normal  component  of  the  velocity.  We  specify 
the  tangential  component  of  the  velocity  weakly  by  adding  another  least- squares  term  to  our 
functional.  As  a  result,  we  arrive  at  the  following  discontinuous  stream  function- vorticity- 
pressure  (SVP)  least-squares  functional: 


Jh_l(t//h,coh,ph;  ih)  = 

h2  ||V  x  a>h  +  Vph  -  fh\\20  +  Y  ||Vx  Vx/- ajh\\lK 

Ke% 

+/T1  Yj  Uj  f  [(y  Xiph)-ni\2d{  +  a2  f  [(V  x  tf/h)  x  ntfdt 

eie&iQ) ' 

+/T1  Y  l(Vx^)xn,|2^  +  r3  Y 


(3.8) 


e,-e6(Q) ' 


The  weight  for  the  last  term  of  (3.8)  is  determined  by  a  scaling  argument  assuming  that 
if/  e  H2  and  hence  its  trace  is  in  /73/2.  The  jump  of  the  stream-function  is  necessary  for 
elements  not  adjacent  to  the  boundary  since  constraining  only  [n  •  V  x  if/]  and  [n  x  V  x  f] 
specifies  ^  only  up  to  a  constant.  Once  (3.8)  is  solved,  the  velocity  is  recovered  through 
formula  (3.7),  i.e.,  on  each  element 


uV  =  Vx^.  (3.9) 

We  can  view  the  discontinuous  SVP  formulation  (3.8)  as  a  special  case  of  the  discontin¬ 
uous  velocity  formulation  (3.5)  with  a  specific  choice  of  a  divergence -free  basis.  We  choose 
to  define  this  basis  through  a  stream  function  as  in  (3.7)  primarily  because  of  the  simplicity  of 
this  choice;  however,  it  should  be  clear  that  our  approach  can  easily  accommodate  any  choice 
of  a  divergence-free  velocity  basis. 

It  is  worth  pointing  out  that  the  discrete  least-squares  method  for  the  Darcy  flow  in  two- 
dimensions  [7]  uses  a  discontinuous  finite  element  space  for  the  flux  defined  in  a  similar 
manner  by 

\h  =  V(Sh)@Vx(Sh),  (3.10) 

where  S hD  and  ShN  are  standard  C°  finite  element  spaces  constrained  by  zero  on  the  Dirichlet 
and  Neumann  portions  of  the  boundary.  The  key  difference  is  that  our  approach  deals  with  the 
discontinuity  of  the  approximating  space  by  including  appropriate  jump  terms  and  retaining 
the  original  differential  operators,  whereas  [7]  retains  the  global  inner  products  but  switches 
to  weak  discrete  differential  operators  defined  using  integration  by  parts. 
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The  use  of  stream  functions  is  not  a  novel  idea,  and  has  been  applied  to  the  Stokes  equa¬ 
tions  [11],  however,  most  research  on  the  SVP  formulation  is  done  using  finite  differences 
because  of  the  presence  of  the  second  derivative.  However,  in  the  discontinuous  framework, 
it  is  not  necessary  to  construct  a  global  i72(Q)-conforming  finite  element  space  as  V  x  V  x  if/ 
is  only  needed  locally. 

4.  Implementation.  All  of  the  above  methods  are  implemented  using  Intrepid  [5]  and 
solved  using  the  KLU  solver  of  Amesos  [17],  both  packages  of  Trilinos  [12].  Intrepid  is 
a  local  framework  that  implements  basis  functions  for  Hl,  H(curl),  and  H(div).  Since  our 
formulations  are  discontinuous,  it  suffices  to  choose  basis  functions  to  be  Hl  on  each  element 
and  implement  the  jump  terms. 

It  is  easy  to  convert  least  squares  functionals  to  an  implementable  weak  form  by  setting 
the  first  directional  derivative  to  zero.  For  example,  the  weak  form  that  minimizes  (3.5)  is  to 
find  (u\  (jL>h,ph)  e  Xj,  such  that 

(Vx^  +  V/jx/i  Vqh)  +  ^(Vxu h  -  ojh,V  x\h  -  sh)  0'K 

Ke% 

+  Ij  (  I  \-uh  ■  n^h  ■  +  f  [Uh  xn][\h  xn]di 

eieS(Q.)  ^ 

for  all  (v,  s,  q)  e  [//^(D)]2  x  x  L2(D).  The  weak  form  for  all  other  least  squares 

functionals  can  be  obtained  in  a  similar  way. 

Since  u  =  (uhv  u is  a  vector  valued  function,  each  component  has  separate  degrees  of 
freedom  and  in  the  cases  where  is  discontinuous,  each  element  has  its  own  set  of  degrees 
of  freedom  for  uhx  and  tfy. 

4.1.  Transformations.  All  basis  functions  are  defined  on  the  reference  element,  thus, 
it  is  necessary  to  use  the  correct  transformation.  Intrepid  provides  implementations  for  the 
most  common  transformations;  however,  non  standard  transformations  such  as  V  x  oj  (curl  of 
a  scalar  field  in  two-dimensions)  and  (for  the  SVP  formulation)  VxVxi/r  are  not  straightfor¬ 
ward.  V  x  co  is  the  curl  of  a  scalar  function  and  hence  is  an  element  of  H(div).  Thus  we  use 
HDIVtransformVALUE  for  curls  of  scalar  functions.  In  two  dimensions,  using  the  identity, 

VXVX^  ~l//xx  -  i f/yy  (4.2) 

it  follows  that  on  a  reference  element,  we  can  compute  V  x  V  x  if/  by  using  OPERATORS) 2 
which  computes  all  second  derivatives  of  \j/.  Once  V  x  V  x  if/  is  computed  for  the  reference 
element,  it  is  necessary  to  transform  to  the  physical  element.  This  is  done  by  noting  that 
V  x  if/  e  H(div ),  and  V  x  v,  for  a  vector  valued  function  v,  is  a  rotated  divergence  and  hence 
HDIVtransformDIV  transformation  is  used. 

4.2.  Boundary  conditions.  In  all  of  our  tests,  we  use  the  velocity  boundary  condition 
where  u|an  =  uD  is  specified  on  the  entire  boundary.  We  set  the  pressure  to  0  at  a  single  point 
on  the  boundary.  Since  the  basis  in  Intrepid  is  interpolatory,  the  boundary  conditions  are  set 
strongly  by  specifying 

u(xi)uD(xi)  Vxi  e  dCl 

P(x  0)0  1  j 

This  is  done  by  defining  a  vector  uq  that  is  zero  for  all  degrees  of  freedom  corresponding  to 
interior  points  and  equal  to  u#  at  the  boundary  degrees  of  freedom.  We  then  set 

b  <-  b  -  Au0  (4.4) 

Each  row  and  column  of  A  corresponding  to  a  boundary  degree  of  freedom  is  set  to  zero  and 
the  diagonal  is  set  to  1 . 


=  (/,Vx/  + V)  o 


(4.1) 
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Fig.  5.1.  Comparison  of  mass  loss  of  for  the  backward-facing  step  ( left  panel )  and  the  flow  past  a  cylinder 
(right  panel)  test  problems.  Blue  line  represents  weighted  L2  formulation  (2.19),  green  line  is  SVP  formulation  (3.8). 


5.  Numerical  examples.  Because  the  discontinuous  SVP  formulation  (3.8)  does  not 
include  the  velocity  some  care  must  be  exercised  in  setting  the  velocity  boundary  condition 
for  our  two  test  problems.  In  the  case  of  the  backward  step,  recall  that  the  boundary  condition 
is  given  by  (2.14).  Because  on  Yin  and  Tout  velocity  is  only  a  function  of  y,  the  Ui  component 
is  integrated  to  obtain  an  equivalent  boundary  condition  on  the  stream  function: 

8  v2  v3 

m  -  +  ^y2  ~  4y  +  Ci,  i//0ut  =  y  -  y  +  c2  (5.i) 

The  constants  Ci  and  C2  are  chosen  so  that  u/n(0.5)  =  uout(0 )  and  uin(l)  =  uout(l).  The 
top  and  bottom  walls  are  then  chosen  to  be  constants  equal  to  u/n(l)  and  uin(0.5)  respectively. 
Likewise,  the  equivalent  stream  function  boundary  conditions  for  the  second  domain,  Figure 
2.2,  with  velocity  boundary  conditions  (2.15)  are 


tyin  ~  ftout  ~  ^wall  ~  y  ^  • 


(5.2) 


However,  setting  the  boundary  conditions  in  this  way  enforces  only  the  normal  component  of 
the  velocity.  In  our  test  cases,  the  tangential  velocity  vanishes  on  all  boundaries.  We  set  the 
tangential  velocity  weakly  by  including 


||n  x  V  x  ^1 1 1 /2,r  ~  ^  ||n  x  V  x 


|n  x  V  x  f/\ 2dl 


(5.3) 


in  the  least  squares  functional  (3.8). 

The  resulting  mass  loss  for  (3.8)  is  summarized  in  Figure  5.1  and  it  is  seen  that  mass 
conservation  is  significantly  improved.  Indeed,  for  the  backward  facing  step,  the  maximum 
mass  loss  is  less  than  1.09%  with  most  of  the  mass  loss  centralized  at  the  reentrant  corner.  On 
the  rest  of  the  domain,  the  solution  is  basically  conserved  over  any  closed  subdomain.  This 
is  a  dramatic  improvement  compared  to  (2.19).  For  the  channel  flow  with  cylinder  cutout, 
the  mass  conservation  is  also  improved  with  a  slight  mass  gain  of  0.34%  at  the  opening  of 
the  cylinder.  Compared  with  (3.5),  the  stream  function  formulation  is  able  to  achieve  better 
mass  conservation  than  (2.19)-recall  that  no  matter  how  a\  and  a2  were  chosen,  the  mass 
conservation  could  not  improve  past  the  weighted  L 2  formulation. 

The  velocity  fields  of  each  formulation  are  visualized  in  Figures  5.2  and  5.5  and  in  the 
case  of  the  backward  facing  step,  the  mass  loss  for  (3.5)  is  clearly  visible.  For  the  SVP 


J.  Lai,  P.  Bochev,  L.  Olson,  K.  Peterson,  D.  Ridzal,  and  C.  Siefert 


27 


formulation,  the  propagation  of  the  parabolic  profile  of  the  inflow  is  clearly  seen  throughout 
the  domain.  In  the  case  of  the  weighted  L 2  formulation,  the  parabolic  profile  diminishes- 
symbolic  of  the  50%  mass  loss.  From  the  additional  figures  (5. 3-5. 8),  it  can  also  be  seen  that 
the  pressure  and  vorticity  are  more  accurately  captured  with  the  stream  function  formulation. 


1  r 
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1  r 


0.5  - 


»: 
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Fig.  5.2.  Velocity  plot  ofC°  (2.19)  (top)  and  SVP  (3.8)  (bottom)  for  the  backward-facing  step. 


Fig.  5.3.  Pressure  plot  ofC°  (2.19)  (top)  and  SVP  (3.8)  (bottom)  for  the  backward-facing  step. 


We  considered  using  divergence  free  bases  on  each  element,  because  of  the  necessity  to 
choose  the  correct  weights  oq  and  in  (3.5).  However,  it  was  not  possible  to  choose  a  single 
set  of  weights  that  is  optimal  for  all  test  cases.  In  the  case  of  (3.8),  only  one  set  of  weights  is 
used  and  proved  to  be  effective. 

6.  Conclusion.  In  this  report  we  have  formulated  new  discontinuous  velocity  LSFEMs 
for  the  Stokes  equations  as  a  means  to  improve  mass  conservation.  These  new  methods  were 
compared  with  a  provably  optimal  norm  equivalent  weighted  L 2  least  squares  formulation. 
The  immediate  discontinuous  velocity  formulation  was  found  to  not  be  robust  as  depending 
on  the  problem,  different  weights  were  required.  As  a  result,  a  local  divergence  free  basis  for 
the  velocity  was  introduced  with  the  use  of  a  stream  function.  The  stream  function  approach 
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Fig.  5.4.  Vorticity  plot  ofC°  (2.19)  (top)  and  SVP  (3.8)  (bottom)  for  the  backward- facing  step. 


Fig.  5.5.  Velocity  plot  ofC°  (2.19)  (top)  and  SVP  (3.8)  (bottom)  for  the  cylinder  channel. 


proved  to  be  robust  as  only  one  set  of  weights,  derived  from  Sobolev  theory,  allowed  the 
resulting  solution  to  be  almost  entirely  mass  conservative. 

The  proposed  approach  is  very  flexible  and  can  be  easily  applied  to  other  LSFEMs  based 
on  the  VVP  or  other  first-order  Stokes  systems.  For  example,  it  is  trivial  to  extend  (3.8)  to 
a  discrete  negative-norm  method,  or  to  a  method  which  uses  velocity  gradient,  velocity  and 
pressure  as  dependent  variables. 
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Fig.  5.7.  Vorticity  plot  ofC°  (2.19)  (top)  and  SVP  (3.8)  (bottom)  for  the  cylinder  channel. 
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Fig.  5.8.  Stream  function  for  backward  step  (top)  and  cylinder  channel  (bottom). 


Future  work  in  the  area  includes  theoretical  studies  of  the  well-posedness  of  discontinu¬ 
ous  formulations  and  an  implementation  of  (3.8)  using  cubic  elements  for  the  stream  function. 
This  allows  the  velocity,  the  curl  of  the  stream  function,  to  be  quadratic-satisfying  the  mini¬ 
mal  degree  requirement  of  the  parent  C°  least- squares  formulation. 
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APPLICATION  OF  A  DISCONTINUOUS  PETROV-GALERKIN  METHOD  TO  THE 

STOKES  EQUATIONS 

NATHAN  V.  ROBERTS*,  DENIS  RIDZAL§,  PAVEL  B.  BOCHEV'1,  LESZEK  D.  DEMKOWICZ11,  KARA  J. 

PETERSON**  AND  CHRISTOPHER  M.  SIEFERT** 

Abstract.  The  discontinuous  Petrov- Galerkin  finite  element  method  proposed  by  L.  Demkowicz  and  J.  Gopalakr- 
ishnan  [5,  6]  guarantees  the  optimality  of  the  solution  in  what  they  call  the  energy  norm.  An  important  choice  that 
must  be  made  in  the  application  of  the  method  is  the  definition  of  the  inner  product  on  the  test  space.  In  this  paper, 
we  apply  the  DPG  method  to  the  Stokes  problem  in  two  dimensions,  analyzing  it  to  determine  appropriate  inner 
products,  and  perform  a  series  of  numerical  experiments. 

1.  Introduction.  Recently,  L.  Demkowicz  and  J.  Gopalakrishnan  have  proposed  a  new 
class  of  discontinuous  Petrov-Galerkin  (DPG)  methods  [5,  6,  7,  10,  3],  which  compute  test 
functions  that  are  adapted  to  the  problem  of  interest  to  produce  stable  discretization  schemes. 
An  important  choice  that  must  be  made  in  the  application  of  the  method  is  the  definition  of  the 
inner  product  on  the  test  space.  In  this  paper,  we  apply  the  method  to  the  Stokes  problem  in 
two  dimensions,  analyzing  it  to  determine  appropriate  inner  products,  and  perform  numerical 
experiments  to  test  these  inner  products. 

Whereas  traditional  Galerkin  methods  use  the  same  space  for  test  and  trial  spaces,  Petrov- 
Galerkin  methods  allow  the  test  and  trial  spaces  to  differ.  The  DPG  approach  computes  test 
functions  that  are  optimal ,  in  a  sense  that  we  make  precise  in  Section  2.  One  consequence 
of  this  choice  of  test  functions  is  that  the  stiffness  matrix  for  a  continuous,  weakly  coercive 
variational  formulation  is  symmetric  (hermitian,  for  complex-valued  problems)  and  positive 
definite.  Of  course,  the  determination  of  test  functions  is  an  extra  step  compared  with  tradi¬ 
tional  methods;  it  is  important  that  these  can  be  determined  cheaply.  By  using  discontinuous 
Galerkin  (DG)  formulations,  DPG  achieves  this,  reducing  the  computation  of  the  test  func¬ 
tions  to  a  local  problem.  Our  method  bears  some  resemblance  to  the  MDG  method  [9]  in  that 
a  local  problem  is  solved  on  each  element.  The  key  difference  with  that  paper  is  that  in  MDG 
the  local  problem  is  restriction  of  the  original  equations  whereas  in  DPG  the  local  problem 
is  implied  by  the  selected  test  space  inner  product.  Furthermore,  in  MDG  the  local  problem 
is  used  to  express  DG  degrees  of  freedom  in  terms  of  continuous  degrees  of  freedom,  i.e.,  to 
effect  static  condensation  on  the  element. 

Our  primary  goal  is  the  application  of  the  method  to  the  Stokes  problem  in  two  dimen¬ 
sions.  The  strong  form  of  the  problem  is 


2juV  -e+Vp  =  f 

in  D, 

(i.D 

V  •  u  =  0 

in  Q, 

(1.2) 

U  =  gD 

on  <9Q, 

(1.3) 

where  Q  c  R2,  p  is  viscosity,  e  =  Vsymw  is  strain,  p  is  pressure,  u  velocity,  and  /  a  vector 
forcing  function. 

The  paper  is  structured  as  follows.  In  Section  2,  we  give  an  introduction  to  the  basic 
features  of  the  DPG  method.  In  Section  3,  we  derive  the  weak  formulation  of  the  problem. 
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In  Section  4,  we  motivate  the  choice  of  inner  product  on  the  test  space  with  reference  to 
an  argument  for  the  continuity  of  the  bilinear  form.  In  Section  5,  we  present  the  numerical 
results.  We  conclude  in  Section  6. 

2.  DPG  Method.  Here,  we  sketch  some  of  the  main  features  of  the  DPG  method.  For 
details,  we  refer  the  reader  to  a  series  of  papers  by  Demkowicz  et  al.,  in  particular  the  second 
ICES  Report  [6],  from  which  most  of  this  section  is  derived.  We  begin  with  theoretical 
definitions  and  results,  and  then  describe  the  approach  to  practical  realization.  Consider  the 
abstract  variational  boundary-value  problem: 

Find  u  e  U  :  b(u ,  v)  =  Z(v)  Vv  €  V.  (2.1) 

We  take  U  and  V  to  be  real  Hilbert  spaces.  We  assume  b(-,  •)  is  continuous,  i.e. 

\b(u,  v)|  <  M \\u\\u  ||v||y  ,  (2.2) 

for  some  real  M.  We  assume  also  that  Z?(-,  •)  is  weakly  coercive,  that  is 

inf  sup  b(u ,  v)  >  y,  (2.3) 

Nlt/  =  !  ||V||y— 1 

for  some  y  >  0.  If  we  additionally  assume  that 

{v  €  V  :  b(u ,  v)  =  0  VueU}  =  {0},  (2.4) 

then  it  is  well  known  that  the  problem  (2.1)  has  a  unique  solution  provided  that  /  e  V',  the 
dual  of  V. 

2.1.  Energy  Norm.  We  define  an  alternate  norm,  called  the  energy  norm ,  on  the  trial 
space  U  by 


IMIs  =  sup  b(u,  v).  (2.5) 

l|v||y=l 

This  norm  is  the  one  in  which  the  optimality  is  guaranteed  by  the  selection  of  optimal  test 
functions.  It  is  an  equivalent  norm  to  the  standard  norm  on  U ,  i.e. 

y\M\u  ^  M\e  ^  M\\u\\u  VueU.  (2.6) 

2.2.  Optimal  Test  Functions.  We  are  now  prepared  to  give  a  definition  of  the  optimal 
test  functions.  Define  a  map  T  :  U  — >  V  from  the  trial  space  to  the  test  space  by:  For  u  €  U, 
define  T  u ,  the  optimal  test  function  corresponding  to  u ,  as  the  unique  solution  to 

(T u ,  v)v  =  b(u ,  v)  Vv  E  V. 

By  the  Riesz  representation  theorem,  T  is  well-defined.  Note  that 

IN \E=  sup  b(u,v)=  sup  (T  u,  v)v  =  1  (T  u,  T  u)v  =  ||  T  u\\v  ■ 

IMIv=l  l|v||y— 1  IV  U\\V 

Thus  the  energy  norm  is  generated  by  the  inner  product  on  V,  i.e. 

(«,  u)E  =  ( Tu ,  Tu)v.  (2.7) 


In  practice,  we  approximate  T  by  a  discrete  operator  Tn ,  described  in  Section  2.4. 
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2.3.  Optimal  Test  Space  for  Un.  Take  a  finite-dimensional  trial  space  Un  c  U.  Define 
the  optimal  test  space  for  Un  as  Vn  =  span{T ej  :  j  =  1,  •  ■  •  ,  n},  where  the  ej  form  a  basis  for 
Un. 

Solve  the  discrete  problem 

Find  un  e  Un  :  b(un ,  v)  =  /(v)  Vv  e  14 .  (2.8) 

Then  the  error  is  the  best  approximation  error  in  the  energy  norm, 

II «  -  u„\\E  =  inf  || u  -  wn\\E  ,  (2.9) 

WneUn 

and  this  is  the  sense  in  which  the  test  space  is  optimal. 

2.4.  Practical  Realization.  The  method  involves  two  steps:  first,  find  the  optimal  test 
functions;  second,  use  the  optimal  test  functions  to  solve  the  discrete  problem  2.8.  The  op¬ 
timal  test  functions  are  not  in  general  polynomials.  In  practice,  we  approximate  them  with 
an  “enriched”  polynomial  space  —  a  space  of  polynomials  of  slightly  higher  degree  than  the 
trial  space.  This  is  done  to  provide  a  higher-fidelity  approximation  to  the  continuous  space 
of  optimal  test  functions.  The  best  choice  for  the  amount  of  “enrichment”  is  determined 
experimentally  for  each  problem. 

In  general,  we  apply  the  following  procedure: 

1.  Given  a  boundary  value  problem,  develop  mesh-dependent  /?(•,•)  with  test  space 
V  that  allows  inter-element  discontinuities  (hence  Discontinuous  Petrov- Galerkin). 
We  develop  this  in  Section  3. 

2.  Choose  trial  space  Un  (in  particular  the  norm  of  interest  in  Un),  and  the  inner  product 
on  V,  which  will  be  motivated  by  the  choice  of  trial  space.  We  detail  this  process  for 
the  Stokes  problem  in  Section  4. 

3.  Compute  optimal  test  functions.  Approximate  T  by  Tn  :  Un  — >  Vn  c  V.  We  use  an 
enriched  space  of  piecewise  polynomials  for  Vn.  Defining  tj  -  Tnej ,  we  solve 

(f/>  G‘)v  —  b(€j,  &i) 

for  tj ,  where  the  eL  form  the  basis  for  Vn. 

4.  Use  the  optimal  test  functions  to  solve  the  problem  on  Un  x  Vn.  We  note  that  the 
stiffness  matrix  here  is  symmetric  positive  definite  (hermitian,  for  a  complex-valued 
problem), 

Kepti)  =  (Tnej,ti)v  =  ( Tnej,Tnei)v  =  (TnehTnej)v 
=  ( Tnehtj)v  =  b{eutj). 

Also,  note  that  this  means  that  we  may  compute  the  stiffness  matrix  in  terms  of  the 
inner  product  on  the  test  space  V,  without  explicit  recourse  to  the  bilinear  form. 

3.  Stokes  Formulation.  Our  general  approach  to  variational  formulation  in  DPG  is  as 
follows.  First,  rewrite  the  strong  form  of  the  problem  as  a  system  of  first-order  partial  differen¬ 
tial  equations.  Then,  multiply  by  test  functions  and  integrate  by  parts,  moving  all  derivatives 
to  the  test  functions.  We  thus  arrive  at  the  ultra-weak  form  of  the  problem,  a  formulation  in 
which  all  solution  variables  are  in  L2. 

Starting  with  the  strong  formulation  defined  in  equations  (1 . 1)-(1 .3),  introduce  stress  cr 
and  vorticity  oj  by 

cr  =  2 pe  -  pi 

1  T 
oj  =  ~  ^ul ) 
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so  that  equation  (1.1)  becomes  simply  -V  •  cr  =  /.  We  also  have 

1 


2 H 


(r  +  pD. 


Since  e  =  Vsymw  =  Vm  -  cu,  the  entire  system  is 


1 

2p 


(cr  +  pi)  -  Vm  +  cu  =  0 

in  Q, 

-Vg:  =  f 

in  D, 

V  •  u  =  0 

in  fl, 

U  =  gD 

on  dQ. 

Note  that  the  antisymmetric  part  of  the  first  equation  recovers  the  definition  of  cu,  so  that  it 
need  not  enter  the  system  separately.  Define  scalar  co  =  cj 2\  =  \(u  1,2  -  M2,i)-  Our  strong 
formulation  is 


J_/  0-12 

2yU  \cr22  +  P) 


■  Vu2  ■ 


-V- 

-V- 


=  fl 
=  fl 


V  •  u  =  0 


U  =  gD 


in  D, 
in  D, 
in  D, 

in  D, 

in  D, 
on  dQ. 


Multiplying  the  first  two  equations  by  vector  test  functions  qi  and  the  following  three  by 
scalar  test  functions  vt,  and  integrating  by  parts  over  an  element  K ,  we  obtain 


-  I  uxqi  •  v 

JdK 

~  I  u2q2  •  v 

JdK 

-  I  5qvi  •  v 

JdK 

~  I  (?2V2  •  V 

JdK 

+  I  WV3  •  V 

JdK 


where  the  “hatted”  variables  (iq,  e.g.)  are  the  introduced  by  relaxing  the  continuity 

requirement  at  element  boundaries.  These  differ  from  the  numerical  fluxes  that  appear  in  other 
DG  methods,  in  that  they  are  not  constructed  a  priori,  but  simply  enter  the  variational  problem 
as  additional  unknowns.  We  solve  for  them  at  the  same  time  as  we  solve  the  rest  of  the 
unknowns.  As  in  other  DG  methods,  the  fluxes  will  approach  the  corresponding  “unhatted” 
solution  variables  as  the  latter  approach  the  exact  solution. 
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4.  Inner  Product  Determination.  As  discussed  in  Section  2,  the  optimality  proof  de¬ 
pends  on  the  continuity  and  weak  coercivity  of  the  bilinear  form.  In  this  section,  we  use 
continuity  to  motivate  a  particular  choice  of  inner  product  on  V. 

We  seek  to  show  that  \b(U,  v)|  <  M  \\U\\u  ||v||v,  for  some  constant  M,  for  spaces  U  and  V 
to  be  specified.  The  norm  on  U  should  be  specified  in  such  a  way  that  minimizing  the  error 
in  this  norm  will  produce  the  results  we  want.  We  define 

\m2u  =  J]( 

i=  1  ' 

where  u\  and  u2  are  as  above,  u2  =  <xn,  U4  =  cr  12  =  cr2\,  U5  =  (T22,  ^6  =  and  u2  =  p ,  is 
the  flux  corresponding  to  the  ith  equation  (that  is,  F\  =  H\,F2  =  ^2,  F3  =  oq  •  v,  F4  =  cr2  •  v, 
and  F$  -u  •  v),  and  the  and  are  positive  weights  that  allow  us  to  emphasize  specific 
components.  The  reason  we  use  the  Hl/1  norm  on  the  fluxes  corresponding  to  //(div)  test 
functions  is  that  q  e  H( div)  =>  tr (q)  e  H~l/2.  Thus  for  fgQ  Ftqi  •  v  to  make  sense  mathe¬ 
matically,  we  require  iq  e  //1/2.  A  similar  argument  establishes  that  the  fluxes  corresponding 
to  Hl  test  functions  should  lie  in  H~l/2.  Let  us  consider  the  first  equation  of  our  bilinear  form, 

v)  =  f(2/i  Vtfi  +  f  UiV  qi-  f  witfi-v 

Jo  JdQ 

+(^T+W,?12)  +(U1’V  ■  4ih~(“i’9i  -y)an 
Now,  by  the  Cauchy-Schwarz  inequality,  we  have 

MU,  v)|  <2-  (||cr n||0  +  Hpllo)  Itenllo  +  (2.  ||cr21||0  +  |M|0J  II412II0 

+  Iloilo  ll^7  '  tflllo  +  1 1^1 1 1 H1  /2  (5Q)  '  yll//-1/2(50)  (4-1) 

Applying  the  finite-dimensional  Cauchy-Schwarz  inequality,  we  have 


•  (ik/iillo  +  \\Gn\\l  +  IIV  •  «i|| l  +  \\qx  •  v||2_1/2J/2 


Note  that  for  a  particular  choice  of  weights,  namely  =  04  =  =  a\  -  a\  =  1, 

we  then  immediately  have 

MU,v)\  <  llt/llc/ (ll^iillg  +  II912II0  +  IIV  •  91II0  +  II«1  •  <-ilHdn)r 

motivating  a  norm 

llfilly,  =  (itenllo  +  Il9i2ll§  +  IIV  •  ffillg  +  \\qx  •  • 

However,  one  of  our  purposes  in  defining  a  weighted  norm  ||I/||  was  to  gain  some  control  over 
scale  equivalence  in  computation  of  the  test  space  inner  product,  and  the  argument  above  in 
providing  a  tight  bound  has  separated  the  weights  from  the  test  space  terms.  Instead,  let  us 


II^IIl2(o) 


z 

i=  1 


\Fi\ 


I  Hl!2(dO) 


OLi 


2 

i= 3 


\H~1/2  (d£l) 


Oti 
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return  to  the  inequality  (4.1),  and  note  that  by  the  definition  of  ||*|lc/9  II^IIl2(0)  ^  &i\\U\\u\ 


similarly,  ||/q| 
we  have 


<  at  \\U\\u  for  i  =  1,2  and  UFA 


<  at  \\U\\u  for  i  =  3, 4, 5.  Thus 


\bi(U,v)\< 


2 F 


a3  +ai  Ikllllo  +  f^-  +  a6  I  Ill'll  No 


2 H 


a\  IIV  •  ?i||0  +  a\  ll?i  ■  vllHi/2(an))  \\U\\u » 


motivating  the  norm 

CK3  +  (X'j 


II? 


lllVi 


2/x 


&4 


ll#nllo  +  I  2^  +  ae )  ^12^°  +  ai  ^  ’  ^tllo  +  ll^i  '  yll//1/2(^n)  • 


For  convenience,  we  implement  a  similar  norm  given  by 


II  q 


2  def  /  ck3  +  aj 


Illy, 


m2  /  ^4 


2/i 

+  11^1  •  v||^2(Q)  • 

Similarly,  for  #2  we  implement 

\  2 


itenHo  +  lg+^l  Mg  +  arjHV-^lg 


H?2||2V2"(g+a6)  H92ilg  +  (g+a7)  \\q22\\20  +  al\\W-q2\\20 


+  a2  \\q2  ■  v||t2(n) . 


For  b^iU,  v),  we  have 


b^iJJ,  v)  =  J  cri  •  Vv 
Jo 

•/,(: 


/  W3 


Vvi 


I- 

-f 

JdQ 


CTiVi  *  V 


Thus 

IW,v)|<||£/|| 

and  the  norm  we  implement  for  v\  is 


•  Vvi 


Similarly, 


and 


1 2  def  2 

v,  =  a3 


I  |v2  II V4  =  04 


II  1 12  def  2 

INI*  =  «i 


<9vi 

dx 

2 

+  oc\ 

0 

dv\  ' 
dy  , 

JV2 

2 

+  a| 
0 

JV2 

dx 

dy 

dv3 

2 

+  0^2 

0 

dv3  ' 

dx 

dj  , 

f3vi. 


+  Q'3ll';illH1/2(aa) )  > 


+  O'3llvlllo- 


+  ^l|V2llo 


+  ^INg. 
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Now,  we  can  define  a  general  norm  on  the  test  space  by 


..,qL;vu...,vM)\\2  = 


qif 


L 

+  YjYbnqn)2  +  ( baqa f) 

i=i 


Based  on  the  analysis  above,  we  require 


(4.2) 


Cl\  —  Ct  i, 


*11 

*21 


(*3  +  Ctl 
2 // 


Ct4 


+ 


c  11  =  0(3, 
C2\  ~  0L4 
C31  =  a  1 


=  0 


Cl  =  O'! ,  £2  =  <^2 


(22  —  Ct  2 


*12 

*22 


0(4 

7T~  +  <*6 
2/i 

0(5  +  0(7 


2^ 


C12  =  0(4 


C22  -  0(5 


C32  =  0(2 


(4.3) 


fl  =  ^3,/2  =  0(4, /3  =  0(5 . 


4.1.  Choice  of  0(  values.  How  do  we  determine  appropriate  weights  07  and  t?/  for  the 
norm  of  t/?  Our  choice  is  motivated  by  considerations  of  ?rorm  equivalence  arising,  for 
instance,  in  the  least-squares  finite  element  literature,  see  [1,  Sec.  4.5].  For  simplicity,  we 
apply  a  similar  guideline,  which  we  call  scale  equivalence.  Let  us  consider  a  mesh  with 
elements  of  size  h.  In  a  least- squares  method,  one  would  motivate  the  choice  of  weights  by 
examining  the  factors  of  h  entering  the  stiffness  matrix  through  derivatives  in  the  bilinear 
form.  One  would  then  select  weights  so  that  each  term  of  the  bilinear  form  had  the  same  h- 
factor,  thereby  ensuring  that  no  single  term  dominates  the  least- squares  functional  as  *  — >  0. 

Recall  that  in  DPG  the  optimality  is  expressed  in  terms  of  the  energy  norm  in  equation 
(2.9),  which  in  turn  is  defined  by  the  inner  product  on  V  in  equation  (2.7).  As  in  least-squares 
methods,  there  is  an  underlying  optimization  principle  (equations  (2.9)  and  (2.5)),  and  thus  it 
makes  sense  to  have  all  terms  in  the  test  space  inner  product  equally  weighted  in  the  discrete 
setting.  In  this  section,  therefore,  we  aim  to  determine  weights  at  and  07  that  will  allow  this. 

Computing  the  optimal  test  functions  involves  the  solution  of  a  problem  of  the  form 


(.ijn  &i)v  ~  *0t/5  g). 


where  the  et  form  the  basis  for  the  enriched  polynomial  space  Vn  used  to  represent  the  test 
functions,  and  tj  is  the  optimal  test  function  corresponding  to  ej  £  U.  Thus  the  matrix  for 
determining  the  optimal  test  functions  is  generated  by  computing  inner  products  (^,  ^)y .  The 
goal  is  to  keep  the  summands  entering  this  matrix  of  the  same  order  of  magnitude  in  h. 

We  assume  a  partition  of  Q  into  quadrilateral  elements.  Since  the  various  components 
(e.g.  q\  and  #2)  of  the  test  function  do  not  interact,  we  can  examine  each  separately.  Suppose 
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that  the  element  has  dimensions  (hi,  h2)  and  q\  =  (")•  Then 

(qi,qi)v  =  J  [a\(y2  +  2xy  +  x2)  +  (b2n  +  b2u)x2y2)  +  ^  e]  -  vj 

=  o[a\(hih\  +  h\h\  +  h\h2)  +  (b2u  +  b\2)h\h\  +  e\(h\h\  +  h\h\)) 

Clearly,  no  choice  of  weights  will  make  all  summands  of  the  same  order  in  both  h\  and  h 2; 
the  best  we  can  do  is  to  make  the  h\  and  orders  of  each  summand  differ  by  no  more  than 
2,  and  make  the  sum  of  the  h\  and  orders  the  same  across  all  summands.  This  can  be 
accomplished  by  setting  a2  =  h\h2,b2n  =  b2n  -  1,  and  e\  =  V^i^2- 

The  computation  with  q 2  is  identical.  Now,  consider  v\  =  xy.  We  have 

(vi,vi)v=  f  (c2ny2  +  c\2x2  +  d\x2y2)  +  f  f2x2y2 
JK  JdK 

=  O  (c2nhih32  +  c\2h\h2  +  (b2u  +  d\)h\h\  +  .f2(h]h\  +  h\h\j) 

Again,  we  cannot  choose  the  coefficients  to  get  each  summand  to  have  the  same  order  in  both 
h\  and  h^,  but  here  at  least  the  c\\  and  c n  coefficients  can  be  chosen  so  that  their  respective 
terms  match  precisely.  We  would  like  to  have  c2n  -  h2v  c2n  =  h^d2  =  1,  and  f2  =  ^fhihi, 
and  similarly  for  V2,  we’d  like  c\x  =  h2,  etc.  This  cannot  be  fully  achieved  because  of  the  way 
the  at  enter  the  inner  product;  specifically,  04  =  C12  =  C21.  Instead,  we  arrive  at  the  following 
weights: 


a\  —  a2—  sjh\h2 

(4.4) 

O' 3  =  h\ 

(4.5) 

@4  =  ^h\h2 

(4.6) 

a5  =  h2 

(4.7) 

—  (%1  —  1 

(4.8) 

'oil  =  V  ^h\h2 . 

(4.9) 

We  detail  numerical  results  for  this  inner  product  in  Section  5.3.  In  Section  5.4,  we  present  a 
version  where  —  (%4  —  0^5  =  1,  with  very  similar  results. 

5.  Numerical  Results.  We  solve  the  Stokes  problem  on  the  domain  (-1, 1)  x  (-1, 1), 
with  fi  -  1 .  We  follow  the  choice  of  manufactured  solution  employed  in  a  paper  by  Cockburn 
et  al.  [4],  in  which  they  apply  the  LDG  method  to  Stokes.  We  compare  our  convergence 
rates  to  theirs;  the  L2  error  measurements  are  not  strictly  comparable  because  they  employ  a 
triangular  mesh,  whereas  we  use  a  quadrilateral  mesh.  As  stated  in  Section  2.4,  the  space  for 
V  is  an  “enriched”  polynomial  space.  The  numerical  results  presented  below  were  produced 
with  test  functions  of  degree  one  higher  than  that  of  the  trial  space. 

Although  the  differences  between  our  meshes  and  those  in  Cockburn  et  al.  mean  that 
our  error  measurements  are  not  strictly  comparable,  we  still  would  expect  to  attain  similar 
rates  of  convergence,  and  for  the  L 2  error  values  in  each  component  to  be  within  an  order 
of  magnitude  or  so.  The  rates  of  convergence  we  would  expect  in  a  velocity-stress-pressure 
(VSP)  least-squares  context  would  be  k  +  1  for  the  velocity  components  u\  and  ,  and  k  for 
the  pressure  p ,  where  k  denotes  the  polynomial  degree  of  the  trial  space  [1,  p.  269].  We  have 
yet  to  carry  out  the  convergence  analysis  for  DPG. 
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Following  Cockburn  et  al.,  we  use 

u\  =  -ex(ycosy  +  sin  y) 

U2  =  exy  siny 
p  =  2 pex  sin  y 

as  our  manufactured  solution.  We  impose  the  constraint  p( 0, 0)  =  po  in  order  to  establish  the 
uniqueness  of  the  solution. 

We  try  four  inner  products,  the  first  two  as  baselines,  and  the  latter  two  as  suggested  by 
our  analysis.  As  expected,  the  choice  of  the  inner  product  makes  a  great  deal  of  difference  to 
the  rate  of  convergence. 

5.1.  Generic  Inner  Product.  As  a  baseline  to  show  the  importance  of  a  good  inner 

product  on  the  test  function  space,  the  results  in  this  section  are  produced  using  a  test  space 
inner  product  unrelated  to  our  analysis.  In  the  general  form  of  the  norm  specified  in  equation 
(4.2),  let  at  =  bij  =  aj  =  dt  =  et  =  =  1. 

As  can  be  seen  in  Table  A.l,  although  our  convergence  rates  generally  start  out  near  the 
asymptotic  rates  predicted,  they  fall  off  quickly.  The  rate  for  pressure  with  quadratic  elements 
is  particularly  poor.  The  L 2  error  values  in  u  are  perhaps  not  too  bad,  within  an  order  of 
magnitude  or  so  of  the  LDG  results.  However,  the  pressure  error  values  are  extremely  poor, 
off  by  up  to  three  orders  of  magnitude. 

As  an  experiment,  we  tried  enriching  the  fluxes,  using  polynomials  of  degree  k+ 1  to  rep¬ 
resent  the  solution  fluxes;  at  the  same  time,  we  enriched  the  test  function  space  further,  using 
polynomials  of  degree  k  +  2.  As  shown  in  Table  A. 2,  this  uniformly  reduces  the  error,  particu¬ 
larly  in  the  pressure,  and  improves  the  convergence  rate  observed  in  the  pressure  for  quadratic 
elements.  Although  cubic  elements  also  saw  uniformly  reduced  error,  the  convergence  rates 
observed  were  somewhat  worse. 

5.2.  “All  Ones”  Inner  Product.  In  Section  4.1,  we  derived  weights  for  the  inner  prod¬ 
uct  so  as  to  weight  all  terms  in  the  determination  of  the  optimal  test  functions  equally.  To 
see  the  impact  of  our  choice  of  those  weights  in  relief,  we  try  an  inner  product  in  which 
cti  =  7£j  =  1 .  Compared  with  the  generic  inner  product  employed  in  the  previous  section,  this 
inner  product  takes  account  of  the  continuity  argument. 

As  can  be  seen  in  Table  B.l,  with  this  choice  of  inner  product,  DPG  performs  slightly 
better  than  with  the  generic  inner  product,  but  the  rates  of  convergence  in  p  are  quite  poor, 
especially  for  quadratic  elements.  For  the  64  x  64  mesh,  we  even  see  regression  in  the  p  error 
compared  with  the  32  x  32  mesh,  suggesting  that  some  terms  in  the  inner  product  dominate 
as  h  — >  0,  preventing  convergence. 

As  in  Section  5.1,  we  tried  enriching  the  fluxes,  using  polynomials  of  degree  k+  1  to  rep¬ 
resent  the  solution  fluxes;  at  the  same  time,  we  enriched  the  test  function  space  further,  using 
polynomials  of  degree  k  +  2.  As  shown  in  Table  B.2,  this  uniformly  reduces  the  error,  particu¬ 
larly  in  the  pressure,  and  improves  the  convergence  rate  observed  in  the  pressure  for  quadratic 
elements.  Although  cubic  elements  also  saw  uniformly  reduced  error,  the  convergence  rates 
observed  for  pressure  were  somewhat  worse. 

5.3.  Mesh-Dependent  Inner  Product.  In  this  inner  product,  we  choose  the  at  values 
as  derived  in  Section  4.1  and  specified  in  equations  (4.4)-(4.9).  There,  we  aimed  to  achieve 
scale  equivalence  in  the  determination  of  the  optimal  test  functions  while  selecting  an  inner 
product  that  allowed  our  argument  for  the  continuity  of  /?(•,  •)  to  remain  intact. 

As  can  be  seen  in  Table  C.l,  with  this  inner  product,  we  have  far  superior  convergence 
compared  with  either  of  the  previous  two  inner  products  we  have  considered.  Here,  the 
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convergence  rates  for  both  velocity  and  pressure  are  very  close  to  those  predicted  by  theory, 
and  the  L 2  error  for  u  is  within  a  factor  of  3  of  the  LDG  results.  However,  our  L 2  error  in 
pressure  remains  more  than  an  order  of  magnitude  worse  than  that  LDG  was  able  to  achieve. 

We  again  tried  enriching  the  flux  space;  the  results  are  in  Table  C.2.  Here,  however, 
the  results  are  merely  comparable  to  those  in  the  enriched  flux  space  experiments  using  the 
previous  two  inner  products.  With  cubic  elements,  enriching  the  fluxes  made  for  slightly 
worse  results,  perhaps  due  to  round-off  errors.  All  told,  it  appears  that  whatever  is  lost  to 
scale  inequivalence  in  the  previous  two  inner  products  is  regained  through  higher- fidelity  flux 
approximation. 

5.4.  Mesh-Dependent  Inner  Product,  Least-Squares  Compromise  a$  =  a  4  =  as  =  1. 

Finally,  we  tried  an  inner  product  with  weights  just  as  in  Section  5.3,  except  a^  -  a4  -  as  - 
1 .  The  rationale  was  that,  in  the  norm  of  U,  these  weights  are  applied  to  the  tensor  cr,  which 
is  a  derivative,  so  the  natural  norm  for  U  in  a  least-squares  approach  (arising  from  concern 
for  scale  equivalence  of  terms  within  the  form  £>(-,  •))  would  have  an  extra  factor  of  h  on  u\ 
and  U2  compared  to  the  components  of  a. 

As  can  be  seen  in  Table  D.l,  the  results  are  almost  identical  to  those  reported  in  the 
previous  section.  The  only  exception  is  the  error  in  the  pressure  on  a  cubic,  64  x  64  mesh,  for 
which  the  present  inner  product  produced  an  error  about  half  that  produced  by  the  previous 
inner  product. 

We  again  tried  enriching  the  flux  space;  the  results  are  in  Table  D.2.  The  enriched  fluxes 
again  give  us  lower  error,  but  as  we  refine,  the  advantage  this  gives  us  appears  to  become  less 
significant;  for  cubic  elements,  the  error  values  for  the  32  x  32  mesh  are  nearly  identical  to 
those  we  attained  for  this  inner  product  without  enriching  the  fluxes. 

6.  Conclusions  and  Future  Work.  A  robust  application  of  the  DPG  method  requires 
a  test  space  inner  product  that  simultaneously  allows  a  proof  of  coercivity  and  continuity  of 
the  variational  form  and  achieves  scale  equivalence  in  both  the  inner  product  matrix  used  to 
compute  the  optimal  test  functions,  and  in  the  stiffness  matrix  used  to  compute  the  solution. 
In  this  paper,  we  have  applied  the  DPG  method  to  the  Stokes  problem,  comparing  several 
inner  product  choices.  The  two  inner  products  that  did  not  account  for  scale  equivalence 
within  the  inner  product  matrix  both  demonstrated  substantially  poorer  performance;  while 
those  that  did  account  for  it  achieved  optimal  convergence  rates. 

The  fact  that  our  L 2  errors  in  pressure  were  substantially  worse  than  those  for  the  LDG 
method  suggests  that  there  may  be  a  better  choice  of  inner  product;  it  may  be  that  an  ex¬ 
amination  of  coercivity  (here  absent)  would  suggest  a  better  choice.  The  strategy  we  intend 
to  employ  in  the  future  is  to  use  test  norms  motivated  by  examining  the  optimal  test  norm 
studied  in  previous  DPG  efforts  (see  [10,  Sec.  2]);  this  is  a  norm  on  the  global  test  space  for 
which  Me  =  IHb. 

Enriching  the  flux  space  erased  most  of  the  distinctions  between  our  various  inner  prod¬ 
ucts,  and  greatly  reduced  the  errors  observed  in  the  pressure.  It  appears  that  the  benefit  of 
using  a  better  inner  product  is  that  we  can  save  the  computational  cost  associated  with  the 
enriched  flux  space! 

In  the  future,  we  plan  to  investigate  the  /zp-adaptive  solution  of  Stokes  equations  using 
DPG,  which  offers  stability  independent  of  discretization  parameters.  We  also  plan  to  use 
DPG  to  solve  Stokes  on  polygons  and  polyhedra. 

The  work  presented  here  was  completed  using  L.  Demkowicz’s  /zp-adaptive  code;  we 
are  presently  implementing  a  DPG  framework  using  Intrepid  [2]  and  Trilinos  [8]. 
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Appendix 


A.  Numerical  Results:  Generic  Inner  Product. 

Table  A.  1 

L2  error  and  h-convergence  rates  for  generic  inner  product  selected  without  reference  to  continuity 
argument,  as  defined  in  Section  5.1:  comparison  with  LDG  [4]. 


Quadratic  Elements 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

3.6e-2 

- 

- 

- 

5.9e-l 

- 

- 

- 

4x4 

5.1e-3 

2.82 

- 

- 

2.0e-l 

1.58 

- 

- 

8x8 

8.1e-4 

2.73 

2.0e-4 

- 

1.7e-l 

0.92 

5.1e-4 

- 

16  x  16 

1.6e-4 

2.59 

2.4e-5 

3.06 

9.6e-2 

0.81 

1.2e-4 

2.09 

32x32 

3.9e-5 

2.46 

2.9e-6 

3.05 

5.1e-2 

0.81 

3.0e-5 

2.04 

64x64 

9.8e-6 

2.36 

- 

- 

2.6e-2 

0.83 

- 

- 

Cubic  Elements 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

2.7e-3 

- 

- 

- 

2.7e-l 

- 

- 

- 

4x4 

1.7e-4 

3.97 

5.8e-5 

- 

2.8e-2 

3.27 

2.4e-4 

- 

8x8 

1.2e-5 

3.88 

3.6e-6 

4.01 

3.8e-3 

3.08 

3.9e-5 

2.62 

16  x  16 

1.0e-6 

3.79 

2.2e-7 

4.02 

5.1e-4 

3.01 

5.3e-6 

2.75 

32x32 

1.0e-7 

3.68 

- 

- 

7.1e-5 

2.96 

- 

- 

64x64 

1.3e-8 

3.55 

- 

- 

3.6e-5 

2.67 

- 

- 

Table  A. 2 

L2  error  and  h-convergence  rates  for  generic  inner  product  selected  without  reference  to  continuity 
argument  with  enriched  fluxes  (kflux  -  k  +  1 ,  ktest  =  k  +  2),  as  defined  in  Section  5.1:  comparison  with 
LDG  [4f _ 


Quadratic  Elements  -  Enriched  Fluxes 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

3.5e-2 

- 

- 

- 

7.6e-l 

- 

- 

- 

4x4 

4.4e-3 

3.01 

- 

- 

1.4e-l 

2.48 

- 

- 

8x8 

5.4e-4 

3.02 

2.0e-4 

- 

1.9e-2 

2.67 

5.1e-4 

- 

16  x  16 

6.6e-5 

3.03 

2.4e-5 

3.06 

3.5e-3 

2.61 

1.2e-4 

2.09 

32x32 

8.2e-6 

3.02 

2.9e-6 

3.05 

7.0e-4 

2.54 

3.0e-5 

2.04 

Cubic  Elements 

-  Enriched  Fluxes 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

2.6e-3 

- 

- 

- 

3.2e-2 

- 

- 

- 

4x4 

1.6e-4 

3.96 

5.8e-5 

- 

9.4e-3 

1.78 

2.4e-4 

- 

8x8 

1.0e-5 

4.00 

3.6e-6 

4.01 

1.2e-3 

2.35 

3.9e-5 

2.62 

16  x  16 

6.0e-7 

4.02 

2.2e-7 

4.02 

1.6e-4 

2.58 

5.3e-6 

2.75 

32x32 

3.7e-8 

4.02 

- 

- 

3.3e-5 

2.57 

- 

- 
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B.  Numerical  Results:  “All  Ones”  Inner  Product. 

Table  B.l 

L2  error  and  h-convergence  rates  for  an  inner  product  for  which  -  a7,  =  1  (i.e.  with  weights 
selected  without  concern  for  scale  equivalence)  as  defined  in  Section  5.2:  comparison  with  LDG  [4]. 


Quadratic  Elements 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

3.6e-2 

- 

- 

- 

5.6e-l 

- 

- 

- 

4x4 

5.0e-3 

2.84 

- 

- 

1.7e-l 

1.70 

- 

- 

8x8 

7.7e-4 

2.77 

2.0e-4 

- 

1.4e-l 

1.00 

5.1e-4 

- 

16  x  16 

1.5e-4 

2.64 

2.4e-5 

3.06 

8.2e-2 

0.87 

1.2e-4 

2.09 

32x32 

3.4e-5 

2.51 

2.9e-6 

3.05 

4.3e-2 

0.85 

3.0e-5 

2.04 

64x64 

8.5e-6 

2.40 

- 

- 

2.2e-2 

0.86 

- 

- 

Cubic  Elements 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

2.7e-3 

- 

- 

- 

2.7e-l 

- 

- 

- 

4x4 

1.7e-4 

3.99 

5.8e-5 

- 

2.7e-2 

3.30 

2.4e-4 

- 

8x8 

1.2e-5 

3.92 

3.6e-6 

4.01 

3.7e-3 

3.09 

3.9e-5 

2.62 

16  x  16 

9.2e-7 

3.84 

2.2e-7 

4.02 

4.8e-4 

3.03 

5.3e-6 

2.75 

32x32 

9.0e-8 

3.73 

- 

- 

6.3e-5 

2.99 

- 

- 

64x64 

1.4e-8 

3.55 

- 

- 

2.8e-4 

2.25 

- 

- 

Table  B.2 

L 2  error  and  h-convergence  rates  for  an  inner  product  for  which  ai  =  'a j  =  1  (i.e.  with  weights 
selected  without  concern  for  scale  equivalence )  with  enriched  fluxes  (kflux  =  k  +  1 ,  ktest  =  k  +  2),  as 
defined  in  Section  5.2:  comparison  with  LDG  [4]. 


Quadratic  Elements  -  Enriched  Fluxes 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

3.5e-2 

- 

- 

- 

7.6e-l 

- 

- 

- 

4x4 

4.4e-3 

3.00 

- 

- 

1.3e-l 

2.50 

- 

- 

8x8 

5.4e-4 

3.02 

2.0e-4 

- 

1.9e-2 

2.66 

5.1e-4 

- 

16  x  16 

6.6e-5 

3.02 

2.4e-5 

3.06 

3.6e-3 

2.60 

1.2e-4 

2.09 

32x32 

8.2e-6 

3.02 

2.9e-6 

3.05 

7.1e-4 

2.53 

3.0e-5 

2.04 

Cubic  Elements 

-  Enriched  Fluxes 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

2.5e-3 

- 

- 

- 

3.5e-2 

- 

- 

- 

4x4 

1.6e-4 

3.95 

5.8e-5 

- 

8.6e-3 

2.02 

2.4e-4 

- 

8x8 

1.0e-5 

3.98 

3.6e-6 

4.01 

1.2e-3 

2.46 

3.9e-5 

2.62 

16  x  16 

6.1e-7 

4.01 

2.2e-7 

4.02 

1.5e-4 

2.65 

5.3e-6 

2.75 

32x32 

3.7e-8 

4.02 

- 

- 

3.3e-5 

2.59 

- 

- 
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C.  Numerical  Results:  Mesh-Dependent  Inner  Product. 

Table  C.l 

L2  error  and  h-convergence  rates  for  a  mesh-dependent  inner  product  with  weights  as  specified  in 
equations  (4.4)-(4.9)  and  discussed  in  Section  5.3,  an  inner  product  that  represents  our  best  compromise 
between  the  continuity  argument  and  concerns  for  scale  equivalence  in  the  determination  of  the  optimal 
test  functions.:  comparison  with  LDG  [4]. 


Quadratic  Elements 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

3.6e-2 

- 

- 

- 

5.6e-l 

- 

- 

- 

4x4 

4.7e-3 

2.91 

- 

- 

l.le-1 

2.39 

- 

- 

8x8 

6.0e-4 

2.94 

2.0e-4 

- 

2.8e-2 

2.15 

5.1e-4 

- 

16  x  16 

7.6e-5 

2.96 

2.4e-5 

3.06 

7.3e-3 

2.07 

1.2e-4 

2.09 

32x32 

9.5e-6 

2.97 

2.9e-6 

3.05 

1.8e-3 

2.05 

3.0e-5 

2.04 

64x64 

1.2e-6 

2.98 

- 

- 

4.3e-4 

2.04 

- 

- 

Cubic  Elements 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

2.7e-3 

- 

- 

- 

2.7e-l 

- 

- 

- 

4x4 

1.6e-4 

4.06 

5.8e-5 

- 

2.4e-2 

3.47 

2.4e-4 

- 

8x8 

9.9e-6 

4.04 

3.6e-6 

4.01 

3.1e-3 

3.22 

3.9e-5 

2.62 

16  x  16 

6.1e-7 

4.03 

2.2e-7 

4.02 

3.9e-4 

3.13 

5.3e-6 

2.75 

32x32 

3.8e-8 

4.02 

- 

- 

4.9e-5 

3.08 

- 

- 

64x64 

2.4e-9 

4.02 

- 

- 

4.3e-6 

3.13 

- 

- 

Table  C.2 

L2  error  and  h-convergence  rates  for  a  mesh- dependent  inner  product  with  weights  as  specified 
in  equations  (4.4)-(4.9)  and  discussed  in  Section  5.3,  with  enriched  fluxes  (kflux  -  k  +  1  ,ktest  =  k  +  2): 
comparison  with  LDG  [4]. 


Quadratic  Elements  -  Enriched  Fluxes 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

3.5e-2 

- 

- 

- 

7.6e-l 

- 

- 

- 

4x4 

4.4e-3 

3.00 

- 

- 

1.2e-l 

2.62 

- 

- 

8x8 

5.4e-4 

3.01 

2.0e-4 

- 

1.9e-2 

2.64 

5.1e-4 

- 

16  x  16 

6.7e-5 

3.01 

2.4e-5 

3.06 

4.7e-3 

2.46 

1.2e-4 

2.09 

32x32 

8.4e-6 

3.01 

2.9e-6 

3.05 

l.le-3 

2.35 

3.0e-5 

2.04 

64x64 

1.0e-6 

3.01 

- 

- 

2.8e-4 

2.27 

- 

- 

Cubic  Elements 

-  Enriched  Fluxes 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

2.5e-3 

- 

- 

- 

3.5e-2 

- 

- 

- 

4x4 

1.6e-4 

3.99 

5.8e-5 

- 

5.4e-3 

2.70 

2.4e-4 

- 

8x8 

1.0e-5 

3.99 

3.6e-6 

4.01 

6.1e-4 

2.92 

3.9e-5 

2.62 

16  x  16 

6.2e-7 

4.00 

2.2e-7 

4.02 

1.3e-4 

2.74 

5.3e-6 

2.75 

32x32 

3.8e-8 

4.00 

- 

- 

2.5e-5 

2.63 

- 

- 
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D.  Numerical  Results:  Mesh-Dependent  Inner  Product,  Least-Squares  Compro¬ 
mise. 


Table  D.l 

L2  error  and  h-convergence  rates  for  an  inner  product  with  weights  as  described  in  Section  5.4, 
an  inner  product  that  brings  concern  for  scale  equivalence  in  the  stiffness  matrix  into  our  compromise 
between  the  continuity  argument  and  concerns  for  scale  equivalence  in  the  determination  of  optimal 
test  functions:  comparison  with  LDG  [4]. 


Quadratic  Elements 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

3.6e-2 

- 

- 

- 

5.6e-l 

- 

- 

- 

4x4 

4.6e-3 

2.97 

- 

- 

1.2e-l 

2.23 

- 

- 

8x8 

5.8e-4 

2.97 

2.0e-4 

- 

2.7e-2 

2.19 

5.1e-4 

- 

16  x  16 

7.4e-5 

2.97 

2.4e-5 

3.06 

6.6e-3 

2.14 

1.2e-4 

2.09 

32x32 

9.4e-6 

2.97 

2.9e-6 

3.05 

1.7e-3 

2.10 

3.0e-5 

2.04 

64x64 

1.2e-6 

2.98 

- 

- 

4.1e-4 

2.07 

- 

- 

Cubic  Elements 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

2.7e-3 

- 

- 

- 

2.7e-l 

- 

- 

- 

4x4 

1.6e-4 

4.10 

5.8e-5 

- 

2.4e-2 

3.47 

2.4e-4 

- 

8x8 

9.7e-6 

4.05 

3.6e-6 

4.01 

3.1e-3 

3.22 

3.9e-5 

2.62 

16  x  16 

6.1e-7 

4.03 

2.2e-7 

4.02 

3.9e-4 

3.12 

5.3e-6 

2.75 

32x32 

3.8e-8 

4.02 

- 

- 

4.9e-5 

3.08 

- 

- 

64x64 

2.4e-9 

4.02 

- 

- 

2.3e-6 

3.26 

- 

- 

Table  D.2 

L 2  error  and  h-convergence  rates  for  an  inner  product  with  weights  as  described  in  Section  5.4, 
an  inner  product  that  brings  concern  for  scale  equivalence  in  the  stiffness  matrix  into  our  compromise 
between  the  continuity  argument  and  concerns  for  scale  equivalence  in  the  determination  of  optimal 
test  functions,  with  enriched  fluxes  ( kjiux  =  k  +  1 ,  klest  -  k  +  2):  comparison  with  LDG  [4]. 


Quadratic  Elements  -  Enriched  Fluxes 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

3.5e-2 

- 

- 

- 

7.6e-l 

- 

- 

- 

4x4 

4.2e-3 

3.08 

- 

- 

4.2e-2 

4.18 

- 

- 

8x8 

5.2e-4 

3.04 

2.0e-4 

- 

1.0e-2 

3.11 

5.1e-4 

- 

16  x  16 

6.5e-5 

3.02 

2.4e-5 

3.06 

2.7e-3 

2.64 

1.2e-4 

2.09 

32x32 

8.2e-6 

3.02 

2.9e-6 

3.05 

7.3e-4 

2.40 

3.0e-5 

2.04 

Cubic  Elements 

-  Enriched  Fluxes 

DPG  Error 

LDG  Error 

DPG  Error 

LDG  Error 

Mesh  Size 

u 

rate 

u 

rate 

P 

rate 

P 

rate 

2x2 

2.5e-3 

- 

- 

- 

3.5e-2 

- 

- 

- 

4x4 

1.5e-4 

4.06 

5.8e-5 

- 

4.5e-3 

2.96 

2.4e-4 

- 

8x8 

9.5e-6 

4.04 

3.6e-6 

4.01 

2.0e-3 

2.07 

3.9e-5 

2.62 

16  x  16 

5.9e-7 

4.02 

2.2e-7 

4.02 

3.6e-4 

2.10 

5.3e-6 

2.75 

32x32 

3.7e-8 

4.02 

- 

- 

4.9e-5 

2.26 

- 

- 
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AN  INVESTIGATION  OF  BLOCK  PRECONDITIONERS  FOR  UNSTEADY 

NAVIER-STOKES 

EDWARD  G.  PHILLIPS*,  ERIC  C.  CYRf  AND  JOHN  N.  SHADID* 

Abstract.  In  this  paper  we  investigate  block  upper  triangular  preconditioners  for  the  saddle  point  system  gen¬ 
erated  by  discretizing  the  unsteady  Navier-Stokes  equations.  We  focus  on  Schur  complement  approximations  used 
within  the  block  structure  of  these  preconditioners.  We  consider  Schur  complements  generated  by  Neumann  series 
approximations  based  on  various  approximations  of  the  convection-diffusion  operator.  We  also  consider  improve¬ 
ments  upon  these  approximations  using  a  sparse  approximate  inverse  (SPAI)  algorithm  and  a  structured  probing 
algorithm.  The  preconditioners  are  compared  based  on  numerical  results. 


1.  Introduction.  In  this  paper,  we  consider  the  numerical  solution  of  the  unsteady 
Navier-Stokes  problem  for  the  flow  of  viscous  Newtonian  fluids:  Given  an  open  bounded 
domain  D  c  Rd  with  boundary  d£l,  time  interval  [0 ,T],  and  data  /,  find  a  velocity  field 
u  =  u(x,  t)  and  a  pressure  field  p  =  p(x,  t)  satisfying 


<9u 

dt 


-  vAu  +  (u  •  V)u  +  Vp  =  f 
V  •  u  =  0 


on  x  [0,  T], 
onfix  [0,  T], 


(1.1) 

(1.2) 


subject  to  inital  and  boundary  counditions,  where  v  is  the  kinematic  viscosity,  A  is  the  Lapla- 
cian,  and  V  is  the  gradient. 

Time  discretization  is  applied  to  this  system  along  with  a  Newton  or  Picard  linearization. 
Then  spatial  discretization  using  finite  differences  or  finite  elements  results  in  large,  sparse 
saddle-point  systems  of  the  form 


F 

B 


(1.3) 


or 


Jix  =  b, 


(1.4) 


where  u  and  p  are  the  discrete  velocity  and  pressure,  F  is  the  discrete  transient  convection- 
diffusion  operator  for  velocity,  BT  is  the  discrete  gradient,  B  is  discrete  divergence,  C  is  a 
stabilization  matrix,  and  /  and  g  account  for  forcing  and  boundary  conditions.  If  the  dis¬ 
cretization  is  LBB  stable  (see  [5]  Chapter  5),  then  no  stabilization  is  required,  and  C  =  0.  For 
a  detailed  description  of  the  linearization  and  discretization  of  (1.1)  and  (1.2)  see  [5]  Chapter 
7. 

In  order  to  efficiently  solve  equation  (1.3),  a  preconditioned  Krylov  method  is  often  used. 
Many  preconditioners  employ  the  block  LU  factorization 


fl  = 


I 

BF~l 


0 

I 


F  Bt  \ 
0  -S  j 


(1.5) 


where 


S  =  C  +  BF~lBT 


(1.6) 


*  Department  of  Applied  Mathematics  and  Scientific  Computation,  University  of  Maryland,  eg- 
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Block  Preconditioners  for  Unsteady  Navier-Stokes 


is  the  pressure  Schur  complement.  For  fast  convergence  of  a  Krylov  method  right-preconditioned 
by  P,  the  operator  J[P~l  should  have  few  distinct  eigenvalues.  As  noted  in  [9],  if  we  let  P 
be  the  upper  block  triangular  factor,  then  JiP~l  is  the  lower  block  triangular  factor  and  has 
a  single  eigenvalue  of  1 .  Building  on  this  idea,  we  let  P  be  an  approximation  to  this  block 
factor 


P  = 


F 

0 


(1.7) 


where  F  and  S  are  approximations  for  F  and  S  respectively  (see  [4]  for  a  more  general  dis¬ 
cussion  of  approximate  block  factorization  preconditioners).  To  apply  P~l,  only  the  action  of 
the  inverses  of  F  and  S  is  required.  F~l  is  often  well  approximated  by  a  multi-grid  precondi¬ 
tioner  as  argued  in  [5]  Chapter  8  and  [14],  so  we  set  F  =  F  and  focus  on  approximations  for 
the  Schur  complement. 

The  aim  of  this  paper  is  to  investigate  the  quality  of  various  Schur  complement  approx¬ 
imations.  The  immediate  goal  is  not  an  efficient  block  preconditioner  but  an  increased  un¬ 
derstanding  of  how  Schur  complement  approximations  impact  preconditioning.  We  consider 
how  each  Schur  complement  approximation  affects  the  convergence  of  a  Krylov  method  ap¬ 
plied  to  the  exact  Schur  complement  and  how  this  compares  to  the  effect  when  the  approxima¬ 
tion  is  used  in  the  block  preconditioner  (1.7)  for  the  saddle  point  system  (1.3).  For  Schur  com¬ 
plement  approximations  which  are  based  on  approximations  of  F~l  we  also  want  to  compare 
the  effect  of  the  F~l  approximation  as  a  preconditioner  for  F.  We  are  particularly  interested 
in  how  the  performance  of  these  preconditioners  scales  with  CFL  and  Reynolds  number  as 
finding  good,  inexpensive  approximations  to  S  is  difficult  for  large  CFL  or  Reynolds  number 
[1]. 

The  remainder  of  this  paper  is  structured  as  follows.  In  Section  2  we  introduce  sev¬ 
eral  Schur  complement  approximations  based  on  Neumann  series  approximations  of  F~l. 
In  Sections  3-4  we  consider  two  ways  to  improve  upon  these  approximations:  the  sparse  ap¬ 
proximate  inverse  algorithm  and  structured  probing.  Section  5  describes  another  approximate 
block  factorization  preconditioner,  the  least-squares  commutator,  for  comparison.  Section  6 
contains  some  computational  results  for  assessing  the  various  preconditioners.  Finally,  con¬ 
clusions  are  drawn  in  Section  7. 

2.  Neumann  series  approximations.  In  considering  approximations  for  the  Schur  com¬ 
plement,  we  first  investigate  those  produced  by  approximating  F~x  by  some  F~l .  The  Schur 
complement  is  then  approximated  through  the  explicit  product 

S=C  +  BF~lBT.  (2.1) 


The  Neumann  series  is  a  simple  polynomial  approximation  for  the  inverse  of  a  matrix  (see 
[10]  Section  12.3.1  for  a  brief  discussion).  Given  a  nonsingular  matrix  F  and  a  preconditioner 
M,  if  p(7  -  FM~l )  <  1,  the  following  expansion  holds 

oo 

F~l  =  M~x  'Yjl  -  (2.2) 

i= 0 

Truncating  this  series  gives  the  approximation 

K-\ 

F"1  =  M"1  ^(/  -  FM'1)', 

1=0 


(2.3) 
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the  K  term  Neumann  series  preconditioned  by  M.  Since  we  require  p(I  -  FM~l)  <  1,  a  good 
choice  for  M  is  an  easily  invertible  approximation  for  F.  Note  that  the  1  term  Neumann  series 
is  equal  to  the  preconditioner  M~l  itself. 

The  choice  of  M  is  then  very  important.  A  good  choice  of  M  will  lead  the  Neumann 
series  to  converge  and  will  produce  a  good  approximation  for  the  Schur  complement.  An 
important  issue  for  efficient  computing  is  the  sparsity  of  S .  Since  we  do  not  alter  S  beyond 
the  approximation  of  F~l ,  it  is  important  that  F~l  is  constructed  to  be  sparse.  Consequently 
M  should  be  chosen  to  be  sparse,  and  since  the  density  of  F~l  increases  as  the  number  of 
Neumann  terms  increases,  K  should  be  kept  small.  For  M  we  consider  the  following  sparse 
approximations  to  F  or  F~l. 

SIMPLE 

SIMPLE  approximates  F  by  its  diagonal. 

SIMPLEC 

SIMPLEC  approximates  F  by  a  diagonal  matrix,  where  the  ith  diagonal  entry  Mu  is 
the  sum  of  the  absolute  values  of  the  row  F *■* .  This  expands  on  the  idea  of  SIMPLE 
by  attempting  to  incorporate  more  data  from  F  into  the  diagonal. 

Block(k) 

Given  a  whole  number  k,  the  block  Jacobi  approximation  preserves  the  block  diag¬ 
onal  of  F  composed  of  k  x  k  blocks.  This  expands  on  SIMPLE  by  preserving  data 
off  of  the  diagonal.  Naturally,  k  should  be  taken  modestly  so  M  is  sparse.  We  have 
considered  k  -  2, 3,  and  4,  and  k  -  1  gives  SIMPLE.  We  consider  systems  where 
the  v  and  y  velocities  are  split  so  uT  =  (uTx,  mJ).  In  this  case,  the  blocking  accounts 
for  the  influence  of  closely  indexed  spatial  nodes  on  each  other.  The  effect  of  this 
blocking  approximation  will  then  be  largely  contingent  on  index  ordering  and  flow 
direction. 

ILU(O) 

ILU(O)  is  the  incomplete  LU  factorization  with  no  fill-in.  It  computes  the  LU  de¬ 
composition  of  F  only  filling  in  according  to  the  sparsity  pattern  of  F.  M~l  is  then 
given  by  U~lL~x.  See  [10]  Section  10.3.2  for  an  algorithmic  description  of  ILU(O). 
This  approximation  is  used  mostly  as  a  benchmark  since  it  produces  denser  approx¬ 
imations  than  desired. 

ILUT(r) 

ILUT  is  the  incomplete  LU  factorization  with  threshold.  Given  a  threshold  r,  ILUT 
computes  the  LU  decomposition,  but  does  not  fill  in  for  any  element  of  F  whose 
magnitude  is  less  than  r  times  the  norm  of  its  row.  See  [10]  Section  10.4.1  for  an 
algorithmic  description. 

SPAI 

SPAI  is  a  sparse  approximate  inverse.  M~l  is  computed  to  minimize  the  norm 
||E(M-1)*;  -  et\\2  within  a  tolerance  for  each  column  of  M~l,  subject  to  a  constraint 
on  the  number  of  nonzeros  allowed  per  column.  SPAI  begins  minimization  for  each 
column  in  the  direction  of  the  corresponding  identity  column.  Nonzeros  are  itera¬ 
tively  added  to  the  search  direction  based  on  the  residual  until  either  the  tolerance  or 
the  maximum  number  of  nonzeros  is  reached.  For  precise  algorithmic  details,  SPAI 
is  described  as  the  Approximate  Inverse  Algorithm  in  [2].  We  used  the  MATLAB 
SPAI  implementation  SPAI2  [7]. 

We  will  refer  to  a  K  term  Neumann  series  preconditioned  by  any  of  these  choices  of 
M  as  (preconditioner _name)_K  (i.e.  a  3  term  Neumann  series  with  a  2  x  2  block  Jacobi 
preconditioner  will  be  referred  to  as  Block(2)_3). 


50 


Block  Preconditioners  for  Unsteady  Navier-Stokes 


3.  Other  uses  for  SPAI.  In  addition  to  approximating  F~l ,  SPAI  can  be  used  in  other 
ways  to  construct  an  approximate  Schur  complement.  First,  it  can  be  used  to  approximate 
the  product  F~lBT  by  a  matrix  X.  This  can  be  accomplished  by  minimizing  \\FX*t  -  Bjjfo  for 
each  column  of  BT  with  the  SPAI  algorithm.  Once  X  is  computed,  S  can  be  approximated  by 
S  =  BX.  Because  X  has  fewer  columns  than  M  this  requires  less  work  than  the  application  of 
SPAI  defined  above.  This  second  use  of  SPAI  will  be  referred  to  as  SPAIb.  Note  that  SPAIb 
cannot  act  as  a  preconditioner  for  a  Neumann  series  approximating  F~l . 

Another  use  of  SPAI  is  to  apply  it  to  the  entire  Schur  complement  approximation.  Note 
that  in  practice  we  do  not  explicitly  require  the  matrix  S  for  preconditioning.  We  merely  need 
the  action  of  S~l.  If  we  approximate  S  and  end  up  with  a  matrix  which  is  denser  than  we 
would  like,  we  can  obtain  a  sparse  approximation  of  S~l  by  applying  SPAI. 

4.  Structured  probing.  Another  method  which  can  be  used  to  produce  a  sparse  ap¬ 
proximation  of  the  Schur  complement  is  structured  probing,  as  defined  in  [13].  Structured 
probing  approximates  a  matrix  A  e  Rnxn  according  to  a  chosen  sparsity  pattern  represented 
in  a  matrix  H  e  {0,  \  }nxn.  A  matrix  of  p  probing  vectors  X  e  {0,  \}nxp  is  then  constructed 
via  graph  coloring  techniques  such  that  if  A  were  already  of  the  desired  sparsity  pattern,  the 
entries  of  A  would  be  preserved  in  the  matrix  AX.  The  entries  of  AX  are  then  mapped  into  an 
approximation  of  A  according  to  the  sparsity  pattern  of  H.  We  use  the  MATLAB  version  of 
the  Structured  Probing  Toolkit  [12]  to  implement  probing. 

We  regard  the  sparsity  of  a  1  term  Neumann  series  with  a  diagonal  preconditioner  as  a 
benchmark.  In  this  case,  if  C  =  0,  then  S  has  the  same  sparsity  pattern  as  BBT .  To  measure 
the  sparsity  of  a  Schur  complement  approximation,  we  appeal  to  the  ratio  of  nonzeros  in  S 
to  nonzeros  in  BBT .  If  this  ratio  is  much  greater  than  1  for  a  good  S ,  then  we  can  attempt  to 
preserve  the  convergence  properties  of  S  and  reduce  its  density  by  probing  it  to  the  sparsity 
structure  of  BBT .  When  probing  is  applied  to  the  Schur  complement  approximation  in  this 
way,  we  refer  to  the  resulting  method  as  (original _method_name)_P  (i.e.  a  3  term  Neumann 
series  with  a  2  x  2  block  Jacobi  preconditioner  which  has  been  probed  will  be  referred  to  as 
Block(2)_3_P). 

5.  The  least-squares  commutator.  The  least-squares  commutator  (LSC)  preconditioner, 
closely  related  to  the  BFBt  preconditioner  [6],  is  a  popular  preconditioner  developed  for 
Navier-Stokes  systems.  It  is  fast  and  does  not  require  any  data  beyond  what  is  needed  to 
construct  the  problem.  Consequently,  it  makes  a  good  benchmark  with  which  to  compare  the 
preconditioners  studied  here.  The  idea  of  LSC,  as  described  in  [5]  and  [3],  is  to  commute 
the  discrete  velocity  convection-diffusion  operator  F  with  the  discrete  gradient  operator  BT 
so  that 


BF~lBT  ~  BM;lBTF~lMp  (5.1) 

where  Mv  is  the  velocity  mass  matrix,  Mp  is  the  pressure  mass  matrix,  and  Fp  is  the  discrete 
pressure  convection-diffusion  operator.  So  that  Fp  need  not  be  constructed  explicitly,  it  is 
obtained  by  solving  the  least  squares  problem 

min  \\[M~1FM~1Bt]j  -  M~l BJ M~pl  [F p]  j\\Mv  (5.2) 

for  each  column  j  of  Fp.  This  defines 

fp  =  mp(bm;1bt)~1(bm;1fm;1bt)  (5.3) 


so  we  let 


S  =  BM; 1  Br(BM~ 1  FM~ 1  B')~ 1  (BM~ 1  Br). 


(5.4) 
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This  approximation  requires  only  the  velocity  mass  matrix  which  is  required  to  build  F.  M~l 
is  commonly  replaced  by  its  diagonal.  Then  the  BM~lBT  terms  are  scaled  discrete  Laplacian 
operators  for  which  there  exist  efficient  solvers.  The  action  of  S~l  is  then  easy  to  compute. 

6.  Computational  results.  The  test  problem  considered  is  a  regularized  lid  driven  cav¬ 
ity  on  a  32  x  32  grid,  using  Q2  -  PI  stable  elements  and  a  Picard  linearization.  The  problem 
is  generated  with  IFISS  [15]  using  its  standard  time-stepping  procedure  (TR-AB2,  described 
in  [8])  with  v  =  1/100  to  a  time  of  0.4.  The  time-step  is  then  adjusted  to  produce  a  given 
CFL.  The  domain  is  [-1, 1]  x  [-1, 1]  so  Re  =  2/v.  The  next  matrix  system  is  used  to  test  the 
preconditioners.  The  Krylov  method  used  is  GMRES  [11]  with  a  stopping  tolerance  of  10-9 
and  zero  initial  guess.  This  very  tight  criteria  may  favor  the  more  robust  preconditioners.  We 
assess  the  effect  of  the  preconditioners  on  the  entire  saddle  point  system,  2ft,  by  solving  (1.3) 
and  the  effect  on  S  and  F  by  solving  the  explicitly  generated  equations 

-Sp  =  g-BF~1f,  (6.1) 

and 

Fu  =  f  -  BTp  (6.2) 

respectively,  as  these  equations  produce  the  same  solution  as  (1.3).  For  (6.1)  we  use  S  as 
preconditioner  and  for  (6.2)  we  use  F.  We  will  use  GMRES  iteration  counts  and  total  com¬ 
putation  time  as  measures  of  performance.  We  note  that  the  MATLAB  codes  used  may  not  be 
highly  optimized,  but  computation  time  can  give  an  indication  of  the  relative  performance  of 
the  preconditioners  studied.  The  computation  times  for  probing  and  SPAI  may  be  particularly 
inflated  by  their  MATLAB  implementations. 

We  begin  by  considering  the  performance  of  Schur  complement  approximations  based  on 
single  term  Neumann  series  approximations  for  F~l:  SIMPLE_1,  SIMPLEC_1,  Block(2)_l, 
ILU(0)_1,  ILUT(0.25)_1,  and  SPALl.  We  use  Block(2)  because  Block(3)  and  Block(4)  give 
very  similar  results  with  slightly  higher  density.  We  found  that  the  choice  of  r  =  0.25  for 
ILUT  gives  a  good  balance  between  density  and  efficiency.  We  also  compare  SPAIb  and  LSC 
for  which  Neumann  series  on  F  do  not  apply.  For  both  SPAI  and  SPAIb  we  use  the  default 
stopping  tolerance  of  0.4  and  a  maximum  of  50  non-zeros  per  column.  Initially  our  results 
will  be  based  on  setting  Re  =  200.  Iteration  counts  and  computation  time  are  compared 
as  functions  of  CFL  for  2ft  in  Figure  6.1,  for  S  in  Figure  6.2,  and  for  F  in  Figure  6.3.  The 
first  thing  to  notice  is  that  iteration  count  for  every  preconditioner  has  the  same  relationship  to 
CFL:  it  is  moderate  and  approximately  constant  for  CFL  up  to  about  5,  then  begins  to  increase 
for  CFL  greater  than  5.  Because  of  this  dependence  on  CFL,  performance  of  preconditioners 
will  be  judged  largely  on  the  right-hand  tail  in  these  figures. 

In  general,  the  growth  rate  of  iteration  count  is  very  similar  between  2ft,  S,  and  F,  but  we 
see  an  interesting  effect  with  Block(2)_l.  Notice  that  for  F,  Block(2)_l  performs  worse  than 
SIMPLEC.l  at  the  largest  CFL,  with  iteration  count  as  large  as  that  of  SIMPLE.l.  But  for 
S  and  2ft,  Block(2)_l  has  a  lower  iteration  count  than  SIMPLEC_1  at  this  CFL.  We  can  also 
see  that  the  difference  in  iteration  count  between  SIMPLE_1  and  SIMPLEC_1  is  dramatically 
greater  for  S  and  2ft  than  for  F.  These  observations  show  that  the  performance  of  a  Schur 
complement  approximation  using  F~l  is  not  entirely  contingent  on  the  performance  of  F~l 
on  F.  On  the  other  hand,  it  seems  that  the  performance  of  a  block  preconditioner  using 
Schur  complement  approximation  S  relies  heavily  on  the  performance  of  S  preconditioning 
S .  Similar  trends  are  seen  in  computation  time.  Note  that  computation  time  is  large  for 
SPALl  because  of  the  extra  work  involved  during  minimization.  An  investigation  of  the 
large  computation  time  for  SPAIb  at  low  CFL  is  deferred  until  density  information  in  Table 
6.2  is  presented. 
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Fig.  6.1.  Iteration  count  and  computation  time  for  various  single  Neumann  term  preconditioners  applied  to  3R. 
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Fig.  6.2.  Iteration  count  and  computation  time  for  various  single  Neumann  term  preconditioners  applied  to  S. 


We  now  turn  to  higher  order  Neumann  series  approximations  for  F~l .  To  avoid  exces¬ 
sive  density  in  the  Schur  complement  approximation,  we  add  more  Neumann  terms  only  to 
the  sparsest  preconditioners:  SIMPLE,  SIMPLEC,  and  Block(2).  We  observed  that  itera¬ 
tion  count  improved  for  each  preconditioner  when  a  second  Neumann  term  was  added,  but 
SIMPLEC  was  the  only  preconditioner  for  which  three  or  more  Neumann  terms  improved 
iteration  count  for  all  CFL.  This  effect  is  well  explained  by  considering  the  spectral  radius 
p(7  -  FM~l ),  as  shown  in  Table  6.1.  For  SIMPLE  and  Block(2),  the  spectral  radius  is  much 
greater  than  1  for  large  CFL.  As  a  result,  iteration  count  degrades  for  large  CFL  when  Neu¬ 
mann  terms  are  added.  The  spectral  radius  for  SIMPLEC  is  close  to  1  for  all  CFL  considered. 
Although  it  slightly  exceeds  1  for  larger  CFL,  this  is  good  enough  to  see  iteration  count  im¬ 
provement  with  up  to  6  Neumann  terms  for  F  and  up  to  5  for  S  and  Ji,  as  shown  in  Figures 
6.4  -  6.6.  The  slope  of  iteration  count  decreases  for  each  Neumann  term  added  in  each  of  the 
3  systems.  The  same  sort  of  decrease  is  apparent  in  computation  time,  with  the  exception  of 
the  third  Neumann  term,  until  SIMPLEC _5  is  comparable  to  ILU(0)_1. 


CFL 

0.01 

0.05 

0.1 

0.5 

1 

5 

10 

50 

100 

500 

1000 

SIMPLE 

0.7 

0.8 

0.7 

0.6 

0.5 

0.7 

1.0 

3.0 

5.5 

25.3 

50.2 

SIMPLEC 

0.9 

0.9 

0.9 

0.8 

0.7 

0.7 

0.9 

1.0 

1.1 

1.3 

1.3 

Block(2) 

0.7 

0.7 

0.7 

0.5 

0.4 

0.5 

0.7 

1.2 

2.1 

3.4 

6.6 

Table  6.1 

Values  of  p(I  -  FM~l)  for  3  choices  of  M. 
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Fig.  6.3.  Iteration  count  and  computation  time  for  various  single  Neumann  term  preconditioners  applied  to  F. 
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Fig.  6.4.  Iteration  count  and  computation  time  for  Neumann  series  preconditioned  by  SIMPLEC  applied  to  Fi. 
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Fig.  6.5.  Iteration  count  and  computation  time  for  Neumann  series  preconditioned  by  SIMPLEC  applied  to  S. 


Although  adding  Neumann  terms  improves  iteration  count  and  computation  time,  it  also 
increases  the  density  of  S .  This  increase  in  density  can  be  seen  in  Table  6.2.  The  ratio  of  non¬ 
zeros  in  S  to  non-zeros  in  BBT  is  constant  over  CFL  for  the  Neumann  series  preconditioned  by 
SIMPLEC,  increasing  with  each  term.  The  ratio  is  also  constant  over  CFL  for  ILU(0)_1,  but 
is  much  greater.  Having  comparable  computation  time  and  lower  density  makes  SIMPLEC _5 
more  favorable  than  ILU(0)_1  for  larger  problems.  ILUT(0.25)_1  and  SPAL1  grow  in  density 
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Fig.  6.6.  Iteration  count  and  computation  time  for  Neumann  series  preconditioned  by  SIMPLEC  applied  to  F. 


as  CFL  increases,  while  SPAIb,  in  contrast,  decreases  in  density  for  larger  CFL.  This  explains 
the  large  computation  times  seen  with  SPAIb  for  low  CFL  in  Figures  6.1  and  6.2.  In  this  case, 
the  SPAI  algorithm  ran  longer  for  low  CFL,  adding  more  non-zeros,  without  reaching  its 
tolerance. 


CFL 

0.01 

0.05 

0.1 

0.5 

1 

5 

10 

50 

100 

500 

1000 

SIMPLEC_1 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

SIMPLEC_2 

2.6 

2.6 

2.6 

2.6 

2.6 

2.6 

2.6 

2.6 

2.6 

2.6 

2.6 

SIMPLEC_3 

4.8 

4.8 

4.8 

4.8 

4.8 

4.8 

4.8 

4.8 

4.8 

4.8 

4.8 

SIMPLEC_4 

7.4 

7.4 

7.4 

7.4 

7.4 

7.4 

7.4 

7.4 

7.4 

7.4 

7.4 

SIMPLEC_5 

10.3 

10.3 

10.3 

10.3 

10.3 

10.3 

10.3 

10.3 

10.3 

10.3 

10.3 

SIMPLEC_6 

13.3 

13.3 

13.3 

13.3 

13.3 

13.3 

13.3 

13.3 

13.3 

13.3 

13.3 

ILU(0)_1 

31.6 

31.6 

31.6 

31.6 

31.6 

31.6 

31.6 

31.6 

31.6 

31.6 

31.6 

ILUT(0.25)_1 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.1 

3.5 

7.2 

21.1 

24.9 

SPALl 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.4 

1.6 

2.0 

2.6 

SPAIb 

2.5 

2.5 

2.5 

1.3 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

1.0 

Table  6.2 

Ratio  of  non- zeros  in  S  to  non-zeros  in  BBT . 


To  assess  the  performance  of  probing,  we  apply  it  to  the  Schur  complement  approxi¬ 
mations  defined  by  ILU(0)_1  and  the  SIMPLEC  Neumann  series  with  up  to  5  terms.  For 
this  problem,  using  the  sparsity  pattern  of  BBT ,  27  probing  vectors  are  required.  Iteration 
counts  and  computation  time  are  compared  with  the  original  non-probed  preconditioners  in 
Figures  6.7  and  6.8.  For  the  SIMPLEC  preconditioners,  probing  only  slightly  increases  iter¬ 
ation  count.  Computation  time  is  much  greater  for  probing  in  this  size  problem  due  to  the 
expensive  graph  coloring  process,  but  grows  slower  with  CFL.  ILU(0)_1  sees  a  much  greater 
deterioration  in  iteration  count  when  probing  is  applied.  This  can  be  attributed  to  the  higher 
density  of  ILU(0)_1  as  compared  to  the  SIMPLEC  preconditioners.  More  information  is  lost 
from  ILU(0)_1  when  probed  to  the  sparsity  pattern  of  BBT . 

Given  what  we  have  seen,  the  preconditioners  of  most  interest  are  ILU(0)_1,  ILUT(0.25)_1, 
SPAIb,  SIMPLEC _5,  and  SIMPLEC_5_P.  Although  dense,  ILU(0)_1  is  consistently  the  best 
in  terms  of  both  iteration  count  and  computation  time.  ILUT(0.25)_1  grows  in  density  as 
CFL  increases  but  is  competitive  in  iteration  count  and  computation  time.  SPAIb  may  have 
larger  iteration  counts  than  SPAI,  but  the  decrease  in  its  density  with  growing  CFL  and  the 
need  for  fewer  minimization  procedures  make  it  a  much  faster  preconditioner  for  large  CFL. 
SIMPLEC _5  is  the  best  of  the  higher  order  Neumann  series  preconditioners  considered,  and 
SIMPLEC_5_P  performs  almost  as  well  in  iteration  count  while  being  10.3  times  sparser. 
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Fig.  6.7.  Iteration  count  and  computation  time  for  probed  preconditioners  applied  to  Ift. 
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Fig.  6.8.  Iteration  count  and  computation  time  for  probed  preconditioners  applied  to  S. 


We  assess  these  preconditioners  further  by  showing  how  their  performance  scales  with 
Reynolds  number  and  mesh  size.  The  Reynolds  number  is  modified  by  changing  the  value 
of  v.  The  scaling  of  iteration  count  with  Reynolds  number  is  plotted  in  Figure  6.9.  Observe 
that  ILU(0)_1  and  ILUT(0.25)_1  degrade  with  an  increase  in  Reynolds  number,  as  they  do 
not  converge  for  very  high  Reynolds  number  when  coupled  with  large  CFL.  The  other  3 
preconditioners  scale  similarly  with  Reynolds  number  and  are  comparable  to  LSC  for  Re  = 
20000. 

We  compared  the  performance  of  the  5  best  preconditioners  with  2  finer  meshes:  64  x  64 
and  128  x  128.  Iteration  counts  and  computation  time  for  these  preconditioners  on  the  three 
meshes  with  Re  =  200  are  plotted  in  Figures  6.10  -  6.12.  The  detriment  of  ILU(0)_l’s 
density  is  clear  on  the  128  x  128  grid  as  MATLAB,  running  on  1  gb  of  RAM,  ran  out  of 
memory.  ILUT(0.25)_1  was  also  too  dense  to  run  at  CFL  =  1000.  Keep  in  mind  that  the 
Schur  complement  approximation  is  explicitly  computed.  In  practice,  matrix  free  multi-grid 
methods  could  be  used  to  avoid  constructing  it  on  the  finest  grid.  Otherwise,  we  do  not  see 
much  change  in  iteration  count .  But  as  far  as  computation  time,  it  appears  that  the  probing 
graph  coloring  algorithm  as  implemented  gets  more  expensive  with  mesh  refinement,  making 
SIMPLEC_5_P  very  slow  on  the  128  x  128  mesh.  Ultimately,  SPAIb  fairs  the  best  with  mesh 
refinement  obtaining  the  lowest  computation  times  for  large  CFL  on  the  finest  mesh,  owing 
to  its  sparsity  and  accurate  approximation  of  S . 
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Fig.  6.9.  Iteration  count  for  the  best  preconditioners  as  it  scales  with  Reynolds  number  on  Ift.  Note 
that  ILU(0)J  did  not  converge  for  Re  =  20000,  CFL  =  1000.  ILUT(0.25)J  did  not  converge  for  Re  = 
2000, 20000,  CFL  =  500, 1000. 


7.  Conclusion.  This  paper  presented  a  number  of  preconditioners  for  saddle  point  prob¬ 
lems  arising  from  discretizations  of  the  unsteady  Navier-Stokes  equations.  The  approach  is 
based  on  an  approximate  block  triangular  factorization,  focusing  on  approximating  the  Schur 
complement.  We  considered  Schur  complement  approximations  based  on  Neumann  series 
approximations  of  F~l .  These  rely  heavily  upon  the  spectral  radius  p(7  -  FM~l).  But  the 
efficiency  of  an  approximation  F~l  as  a  preconditioner  for  F  is  not  directly  analogous  to  the 
efficiency  of  a  block  preconditioner  using  F~l  to  approximate  S .  Structured  probing  can  be 
employed  to  reduce  the  density  of  a  Schur  complement  approximation  without  increasing 
iteration  count  much,  but  the  graph  coloring  procedure  involved  proves  very  expensive.  A 
competitive  Schur  complement  approximation  for  larger  problems  is  developed  by  adapting 
a  sparse  approximate  inverse  algorithm  to  approximate  F~lBT .  It  will  be  valuable  to  study 
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Fig.  6.10.  Iteration  count  and  computation  time  for  the  best  preconditioners  on  a  32  x  32  grid  for  1ft. 
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Fig.  6.1 1.  Iteration  count  and  computation  time  for  the  best  preconditioners  on  a  64  x  64  grid  for  1ft.  Note  that 
ILUT(0.25)J  did  not  converge  for  CFL  =  1000  and  SPAIb  did  not  converge  for  CFL  =  500,  1000. 
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Fig.  6. 12.  Iteration  count  and  computation  time  for  the  best  preconditioners  on  a  128x128  grid  for  1ft.  ILU(0)J 
consumed  too  much  memory  to  be  run.  ILUT(0.25)J  also  ran  out  of  memory  for  CFL  =  1000. 


the  robustness  of  the  preconditioners  presented  here  by  applying  them  to  other  domains  and 
using  stabilized  finite  elements.  It  is  also  important  to  note  that  in  practice  a  direct  solver  will 
not  be  used  to  invert  the  Schur  complement  approximation.  Keeping  this  in  mind,  it  would 
be  of  interest  to  study  how  these  Schur  complement  approximations  interact  with  multi-grid 
methods. 
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EFFICIENTLY  COMPUTING  TENSOR  EIGENVALUES  ON  A  GPU 

GREY  BALLARD*,  TAMARA  KOLDAf,  AND  TODD  PLANTENGA* 

Abstract.  The  tensor  eigenproblem  has  many  important  applications,  and  both  mathematical  and  application- 
specific  communities  have  taken  recent  interest  in  the  properties  of  tensor  eigenpairs  as  well  as  methods  for  com¬ 
puting  them.  In  particular,  Kolda  and  Mayo  [3]  present  a  generalization  of  the  matrix  power  method  for  symmetric 
tensors.  We  focus  in  this  work  on  efficient  implementation  of  their  algorithm,  known  as  the  shifted  symmetric 
higher-order  power  method,  and  on  how  a  GPU  can  be  used  to  accelerate  the  computation  up  to  7 Ox  over  a  sequen¬ 
tial  implementation  for  an  application  involving  many  small  tensor  eigenproblems. 

1.  Introduction.  The  tensor  eigenproblem  has  many  important  applications,  and  both 
mathematical  and  application-specific  communities  have  taken  recent  interest  in  the  proper¬ 
ties  of  tensor  eigenpairs  as  well  as  methods  for  computing  them.  In  particular,  Kolda  and 
Mayo  [3]  present  a  generalization  of  the  matrix  power  method  for  symmetric  tensors.  We 
focus  in  this  work  on  efficient  implementation  of  their  algorithm,  known  as  the  shifted  sym¬ 
metric  higher-order  power  method  (SS-HOPM). 

The  main  motivating  application  for  this  work  involves  detection  of  nerve  fibers  in  the 
brain  from  diffusion-weighted  magnetic  resonance  imaging  data.  In  this  application,  data 
is  gathered  for  millions  of  cubic  millimeter  sized  voxels.  Determining  the  number  and  di¬ 
rections  of  nerve  fiber  bundles  within  each  voxel  requires  solving  a  small  tensor  eigenvalue 
problem.  Because  each  voxel  can  be  resolved  independently,  the  computations  are  amenable 
to  parallelism,  and  we  focused  our  implementation  on  a  graphics  processing  unit  (GPU)  using 
the  Compute  Unified  Device  Architecture  (CUDA)  programming  framework. 

We  review  the  definition  of  the  tensor  eigenproblem  as  well  as  the  SS-HOPM  algorithm 
from  [3]  in  Section  2.  All  of  the  tensors  discussed  here  are  symmetric,  and  exploiting  symme¬ 
try  is  the  foremost  sequential  optimization  we  use  to  gain  performance.  Symmetric  matrices 
can  be  stored  in  half  the  space  and  symmetric  matrix  computations  often  require  only  half 
the  flops  of  their  nonsymmetric  counterparts;  exploiting  symmetry  in  tensors  can  save  stor¬ 
age  and  computation  by  much  larger  factors.  In  Section  3  we  discuss  a  symmetric  tensor 
storage  format  and  how  this  compressed  format  is  used  in  the  main  computational  kernels  of 
SS-HOPM. 

Instead  of  attempting  to  write  an  algorithm  that  offers  high  parallel  performance  for  com¬ 
puting  eigenpairs  of  tensors  of  general  order  and  dimension,  we  focus  the  GPU  implementa¬ 
tion  on  small  tensors,  as  in  our  motivating  application.  Because  of  the  inherent  parallelism 
in  the  problem,  we  can  run  many  independent  threads  concurrently  on  the  hardware,  and  we 
facilitate  efficiency  of  each  thread  with  careful  memory  management.  We  offer  an  overview 
of  GPU  computing  in  Section  4,  describe  the  motivating  application  in  Section  5,  and  give 
the  details  and  results  of  our  implementation  in  Section  6. 

The  main  contributions  of  this  work  are  (1)  the  introduction  of  a  symmetric  storage  for¬ 
mat  and  means  of  exploiting  symmetry  to  avoid  redundant  computation,  and  (2)  a  parallel 
implementation  of  SS-HOPM.  While  the  implementation  is  tailored  to  a  specific  application, 
we  believe  it  will  be  widely  applicable  to  high  performance  computations  with  symmetric 
tensors. 

2.  Symmetric  Tensors  and  Tensor  Eigenpairs.  We  formally  introduce  the  notion  of  a 
symmetric  tensor  which  is  invariant  under  any  permutation  of  its  indices.  Let  be  the  set 
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of  real- valued  order-m  tensors  where  each  mode  has  dimension  n. 

Definition  2.1  (Symmetric  tensor  [1]).  A  tensor  A  e  R[m,n]  is  symmetric  if 

=  ah-im  for  all  n]  and  nellm 

where  Um  is  the  set  of  permutations  of  the  set  {1, ... ,  m}. 

The  main  computational  kernels  in  the  shifted  symmetric  higher-order  power  method 
will  be  instances  of  the  following  definition  of  symmetric  tensor- vector  multiply. 

Definition  2.2  (Symmetric  tensor-vector  multiply  [3]).  Let  A  e  R[m,n]  he  symmetric  and 
x  e  Rn.  Then  for  0  <  p  <  m  -  \,  the  (in  -  p)-times  product  of  the  tensor  A  with  the  vector  x 
is  denoted  by  Axm~p  6  R[p,n]  and  defined  by 

(Axm~p)iv..ip  =  ^  ah...imxip+i  ■  ■  ■  xim  for  all  1  <  i'i, . . . ,  ip  <  n.  (2.1) 

Note  that  there  is  ambiguity  in  defining  a  tensor  times  the  same  vector  in  some  subset  of 
modes,  but  due  to  symmetry  the  choice  of  indexing  below  yields  the  same  result  as  any 
other  valid  definition.  Also  note  that  the  result  of  a  symmetric  tensor- vector  multiply  is  also 
a  symmetric  tensor,  because  any  permutation  of  the  indices  of  the  result  tensor  (fi  •  •  •  ip) 
on  the  left  hand  side  of  Equation  2. 1  corresponds  to  a  permutation  of  the  first  p  indices  of 
the  symmetric  input  tensor  entries  in  the  summation  on  the  right  hand  side  which  remains 
invariant. 

We  recall  the  definition  of  a  tensor  eigenpair  used  in  [3].  There  are  other  definitions  of 
eigenvalues  and  eigenvectors  in  the  literature,  but  the  relationships  between  the  definitions 
and  the  many  interesting  properties  of  tensor  eigenvalues  are  beyond  the  scope  of  this  work. 

Definition  2.3  (Symmetric  tensor  eigenpair  [3]).  Assume  that  A  is  a  symmetric  mth -order 
n-dimensional  real-valued  tensor.  Then  A  E  C  is  an  eigenvalue  of  A  if  there  exists  x  e  Cn 
such  that 


Axm~l  =  Ax  and  ||x||2  =  1.  (2.2) 

The  vector  x  is  the  corresponding  eigenvector,  and  (A,  x)  is  called  an  eigenpair. 

Finally,  we  present  the  shifted  symmetric  higher-order  power  method  (SS-HOPM)  from 
[3]  as  Algorithm  1.  This  algorithm  is  a  generalization  of  the  matrix  power  method  where  the 
operation  Axm~l  generalizes  the  matrix- vector  product  and  Axm  generalizes  the  Rayleigh 
quotient  for  a  unit  vector.  Algorithm  1  includes  the  shift  parameter  a  which  is  chosen  to  force 
the  underlying  function  to  be  convex  ( a  >  0)  or  concave  (a  <  0). 

The  symmetric  higher-order  power  method  (with  no  shift)  was  introduced  in  [2,  4],  and 
convergence  of  the  method  was  proved  for  certain  types  of  tensors.  While  the  symmetric 
higher-order  power  method  does  not  converge  in  general,  choosing  a  sufficiently  large  (in 
absolute  value)  shift  a  guarantees  convergence  of  SS-HOPM.  The  convergence  properties  of 
a  given  eigenpair  are  characterized  in  [3],  but  there  are  still  many  open  problems  regarding 
choice  of  starting  vector,  choice  of  shift,  and  finding  eigenpairs  with  certain  properties. 

3.  Exploiting  Symmetry. 

3.1.  Symmetric  Tensor  Storage.  Let  A  e  R[m,n]  be  a  symmetric  tensor.  In  general,  A 
has  nm  entries,  but  since  it  is  symmetric,  many  of  the  entry  values  are  repeated  and  need  not 
be  stored  redundantly.  We  define  an  index  as  a  number  i  e  {1, . . . ,  n},  we  define  a  tensor  index 
as  an  array  of  m  indices  corresponding  to  one  entry  of  the  tensor,  and  we  define  an  index  class 
as  a  set  of  tensor  indices  such  that  the  corresponding  tensor  entries  all  share  a  value  due  to 
symmetry.  For  example,  for  m  =  3  and  n-  2,  the  possible  indices  are  1  and  2,  and  the  tensor 
indices  [1, 1,2]  and  [1, 2, 1]  are  in  the  same  index  class  since  <2112  =  <2121. 
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Algorithm  1  Shifted  Symmetric  Higher-Order  Power  Method  (SS-HOPM)  [3] 
Given  a  tensor  A  e  R[m,n] . 

Require:  with  ||x0||  =  1.  Let  do  =  Ax™ . 

l:  for  k  =  0, 1, . . .  do 
2:  if  a  >  0  then 

3:  x*+i  <-  Ax™~1  +  axk 

4:  else 

5:  Xfc+1  <-  -(Ax™-1  +  axk) 

6:  end  if 

7:  Xk+i  <-  Xfe+i/HXit+ill 

8:  4+i  <-  dtx^+1 

9:  end  for 


We  can  find  a  unique  representative  of  an  index  class  by  choosing  the  tensor  index  whose 
indices  are  in  nondecreasing  order.  We  define  this  nondecreasing  tensor  index  as  the  index 
representation  of  the  index  class. 

The  index  classes  of  A  can  also  be  characterized  by  the  number  of  occurrences  of  each 
index  i  e  {1, . . . ,  n]  in  the  tensor  indices  of  the  index  class.  Thus,  we  can  define  the  monomial 
representation  of  an  index  class  as  an  array  of  n  integers  where  the  ith  entry  in  the  array  corre¬ 
sponds  to  the  number  of  occurences  of  the  index  i  in  the  index  class.  Following  the  example 
given  above,  the  index  class  that  includes  [1, 1,2]  and  [1,2, 1]  has  monomial  representation 
[2, 1]  since  there  are  two  l’s  and  one  2  in  every  tensor  index  in  the  class. 

In  order  to  avoid  redundant  storage,  we  store  only  the  unique  values  of  the  tensor  (i.e. 
one  value  per  index  class).  The  following  property  gives  the  number  of  unique  values  of  a 
dense  symmetric  tensor. 

Property  3.1.  The  number  of  unique  values  of  a  symmetric  tensor  A  e  R[m,n]  is  given  by 
the  binomial  coefficient 


m  +  n  -  1 


m 


+  Oinm~x). 


ml 


Proof  Each  index  class  corresponds  to  a  unique  value.  Counting  the  number  of  possible 
monomial  representations  of  length  m  with  n  possible  values  is  equivalent  to  counting  the 
number  of  ways  to  distribute  m  indistinguishable  balls  into  n  distinguishable  buckets,  where 
the  balls  correspond  to  the  indices  of  the  tensor  index  and  the  buckets  correspond  to  the 
possible  index  values.  By  a  “stars  and  bars”  argument,1  this  number  is 

(m  +  n  —  1 
m 


(n  +  m  -  1)  •  •  •  in  +  \)n  _  nm  m_x 

- ; -  -  — r  +  Oyn  ) 

ml  ml 


as  claimed.  □ 

Assuming  A  is  dense,  we  can  impose  an  ordering  on  the  unique  entries  and  avoid  storing 
any  index  information.  We  choose  to  use  a  lexicographic  order  of  the  index  classes,  increas¬ 
ing  with  respect  to  the  index  representation  and  decreasing  with  respect  to  the  monomial 
representation.  That  is,  the  index  class  with  index  representation  [i\ ,  4, . . . ,  im]  is  listed  be¬ 
fore  [71,72,  •  •  •  ,7m]  if  h  <  ji  or  if  i\  =  j\  and  h  <  72,  and  so  on.  Equivalently,  the  index 
class  with  monomial  representation  [4,4,  ...,kn]is  listed  before  U1J2, . . . ,  4]  if  4  >  l\  or 

^ee  Theorem  2  in  Section  4.6  of  [9],  for  example. 
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index 
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4 

4 
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1 
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3 

3 

3 
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3 
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4 

0 

0 

2 

1 

19 

3 

4 

4 

0 

0 

1 

2 

20 

4 

4 

4 

0 

0 

0 

3 

Table  3.1 

Set  of  index  classes  d[3,4]  in  lexicographic  order. 


if  k\  =  1 1  and  k 2  >  I2,  and  so  on.  This  corresponds  to  an  ordering  on  monomials  in  a  given 
polynomial  ring  (the  origin  of  the  terminology).  In  this  case,  the  index  classes  correspond 
to  monomials  which  all  have  total  degree  m.  See  Table  3.1  for  an  example  of  lexicographic 
ordering  for  both  representations  in  the  case  m  =  3  and  n-  4. 

While  the  lexicographic  ordering  makes  storing  index  information  for  every  unique  value 
unnecessary,  it  will  be  important  to  compute  index  information  during  computations.  Since 
the  index  representation  requires  m  integers  and  the  monomial  representation  requires  n  inte¬ 
gers  and  we  expect  n  »  m  for  most  problems,  we  store  the  index  representation  and  compute 
monomial  representation  values  implicitly.  Note  that  while  the  monomial  representation  will 
be  sparse  when  n  »  m,  even  a  compressed  format  would  require  at  least  m  integers. 

3.2.  Computational  Kernels.  The  two  most  computationally  intensive  kernels  in  Algo¬ 
rithm  1  are  computing  the  scalar  Axm  and  the  vector  Axm~l,  where  A  e  is  symmetric 
and  x  e  W1.  Both  of  these  are  instances  of  the  symmetric  tensor- vector  multiply  given  in 
Definition  2.2,  with  p  =  0  and  p  =  1,  respectively. 

3.2.1.  Tensor  times  same  vector  in  all  modes.  Consider  the  case  p  -  0: 

n  n 

Axm  =  2]  ' "  ai'"<x‘i ' '  ■  x‘m  (3-1) 

i]  =  1  im= 1 

For  a  nonsymmetric  tensor,  this  summation  requires  at  least  one  multiplication  for  each  term 
(corresponding  to  each  entry  of  A),  yielding  at  least  nm  flops.  However,  we  can  exploit 
symmetry  to  reduce  the  computational  complexity.  Note  that  the  tensor  index  matches  the 
indices  of  the  x  vector  entries  for  each  term  in  the  summation.  Since  the  product  of  a  set  of 
numbers  is  also  invariant  under  permutation,  all  of  the  terms  in  the  summation  corresponding 
to  the  same  index  class  will  have  the  same  value. 

For  example,  for  m  =  3  and  n  =  2,  the  term  in  the  summation  corresponding  to  the 
tensor  index  [1,1,2]  is  given  by  <2112  •  x\  •  x\  •  X2  =  amx \x2,  and  the  term  in  the  summation 
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corresponding  to  the  tensor  index  [1, 2, 1]  is  given  by  <2121  •  x\  •  x2  ■  x\  =  <2121^X2.  Any  tensor 
index  with  monomial  representation  [2, 1]  will  yield  this  value. 

We  can  avoid  recomputing  the  redundant  value  by  instead  computing  the  number  of  times 
each  unique  term  appears  in  the  summation,  which  is  given  by  the  following  property. 

Property  3.2.  The  number  of  tensor  indices  of  a  symmetric  tensor  A  e  R[m,n]  in  the  index 
class  with  monomial  representation  [k\,k2, . . . , kn]  is  given  by  the  multinomial  coefficient 

(m  \  ml 

k\,k2, . . .  ,kn)  kf.  k2l  •••  kn\ ' 

Proof  Consider  the  monomial  representation  [k\,  k2,  ...,kn].  Counting  the  number  of 
tensor  indices  in  this  class  is  equivalent  to  counting  the  number  of  ways  one  can  distribute  m 
distinct  balls  into  n  distinct  bins  such  that  the  ith  bin  has  kt  balls.  Here  the  balls  correspond  to 
the  (ordered)  indices  of  the  tensor  index  and  the  bins  correspond  to  the  possible  index  values. 
One  way  to  solve  this  problem  is  to  count  the  number  of  ways  of  filling  the  first  bin  (given  by 
the  binomial  coefficient  ^),  followed  by  the  number  of  ways  of  filling  the  second  bin  (given 
by  and  so  on.  Using  the  product  rule  and  after  much  cancellation,  we  have 

(m\  (m-k\\  (m  -  (k\  +  k2  +  •  •  •  +  kn-\)\  ml 

kj  \  k2  )  "{  kn  j  kxlk2l  •••  knl 

as  claimed.  □ 

We  can  thus  rewrite  Equation  3.1  as 


where  is  the  set  of  index  classes  for  a  symmetric  tensor  in  R[m,n],  and  \k\, ... ,  kn]  and 
[fi, . . . ,  im]  are  the  monomial  and  index  representations  of  the  index  class  /,  respectively. 
Equation  3.2  yields  Algorithm  2,  which  assumes  the  unique  values  of  A  are  stored  in  lex¬ 
icographic  order.  For  each  unique  value,  the  algorithm  computes  the  monomial  coefficient 
and  index  array  associated  with  the  tensor  entry  and  adds  the  contribution  of  that  term  to  the 
accumulating  result. 

3.2.2.  Tensor  times  same  vector  in  all  modes  but  one.  Now  consider  computing  the 
vector  Axm~\  the  case  p  -  1  in  Definition  2.2: 

n  n 

=  E  "  '  E  ‘  ‘  ■  X‘n,  (3‘3) 

*2  =  1  im- 1 

Note  that  the  jth  component  of  yixm_1  does  not  depend  on  every  tensor  entry,  only  those  tensor 
entries  whose  index  representation  starts  with  index  j.  Because  of  symmetry,  Equation  3.3 
can  be  rewritten  with  i\  appearing  as  any  index  in  the  tensor  index  of  the  tensor  value. 

As  in  the  case  of  computing  Axm,  we  can  exploit  symmetry  to  avoid  performing  the  more 
than  nm  multiplications  required  to  compute  all  entries  of  the  output  vector  if  we  followed 
Equation  3.3.  As  before,  if  a  tensor  value  contributes  to  the  summation  for  index  k  of  the 
output  vector,  its  symmetric  counterparts  will  contribute  the  same  value  to  the  sum.  Following 
the  example  given  before,  where  m  =  3  and  n  =  2,  both  a\\2  and  <2121  will  contribute  to  the 
computation  of  (Axm~1^,  and  each  will  contribute  the  value  <2112  •  x\  •  x2.  Note  that  <2211  will 
not  contribute  to  the  summation  for  (yLxm_1^  because  its  first  index  is  not  1. 
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Algorithm  2  Compute  y  =  Axm 
and  y  e  R 

via  Equation  3.2,  where  A  €  is  symmetric,  x  €  Rn, 

Require:  A  stores  the  unique  entries  of  A  in  lexicographic  order 

1 

function  y  =  SymmetricTensorVectorMultiplyO(A,  x) 

2 

j  =  0 

3 

/  =  [i,  -  - ,  i] 

>  use  index  representation  (length  m ) 

4 

for ./  =  1  to  (m+”-1)  do 

>  iterate  over  unique  entries 

5 

X  =  Xj]  •  Xi2  •  •  •  X]m 

>  compute  monomial  value 

6 

C  =  NumOccO(/) 

>  compute  number  of  occurrences 

7 

y  =  y  +  C  •  Aj  •  x 

>  accumulate  sum 

8 

I  =  UpdateIndex(7) 

>  See  Algorithm  4 

9 

end  for 

10 

end  function 

Require:  I  has  length  m  with  entries  in  nondecreasing  order 


11 

function  C  =  NumOccO(I) 

12 

div  =  1 

>  divisor  of  (U'J 

13 

curr  =  -1 

>  current  index  value 

14 

mult  =  -1 

>  multiplicity  of  current  index  value 

15 

for  j  =  1  to  m  do 

16 

if  Ij  ±  curr  then 

17 

mult  =  1 

18 

curr  =  Ij 

19 

else 

>  repeated  index 

20 

mult  =  mult  +  1 

21 

div  =  div  •  mult 

>  only  update  divisor  if  mult  >  1 

22 

end  if 

23 

end  for 

24 

C  -  ml  /  div 

25 

end  function 

Computing  the  number  of  tensor  indices  in  an  index  class  that  will  contribute  to  a  given 
entry  of  the  output  vector  is  a  variation  on  Property  3.2.  Consider  an  index  class  that  con¬ 
tributes  to  the  jth  entry  of  the  output  vector  (i.e.,  an  index  class  whose  index  representation 
includes  an  index  j).  Let  [k\,  k^, . . . ,  kn]  be  the  monomial  representation,  so  that  kj  >  0.  In 
the  context  of  assigning  m  balls  to  n  bins  with  appriopriate  multiplicities,  we  can  assign  the 
first  ball  to  the  jth  bin  (enforcing  that  the  tensor  index  starts  with  j).  Then  we  have  m  -  1 
more  balls  to  assign  to  the  n  bins,  but  only  kj  -  1  more  will  be  assigned  to  the  jth  bin.  Using 
the  approach  given  in  the  proof  of  Property  3.2,  we  see  that  the  number  of  tensor  indices  that 
will  contribute  the  same  value  to  the  jth  element  is  given  by  the  multinomial  coefficient 


m  -  1 
ku  ...,kj-l,...,kn 


Now  we  can  rewrite  Equation  3.3  as 


Ie3[m’n] 
k;>  0 


m  -  1 


■  •  X, 


kf- 1 


(3.4) 
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where  is  the  set  of  index  classes  for  a  symmetric  tensor  in  and  \k\, ... ,  kn]  and 

[i’i, . .  .,im\  are  the  monomial  and  index  representations  of  the  index  class  7,  respectively. 
Equation  3.4  yields  Algorithm  3. 


Algorithm  3  Compute  y  =  Axm  1  via 

x,  y  e  Rn 

Equation  3.4,  where  A  e  R[m,n]  is  symmetric,  and 

Require:  A  stores  the  unique  entries  of  symmetric  tensor  A  in  lexicographic  order 

1 

function  y  =  SymmetricTensorVectorMultiply1(A,  x) 

2 

y  =  o 

3 

1] 

>  use  index  representation  (length  m ) 

4 

for  7  =  1  to  (m+”_1)  do 

>  iterate  over  unique  tensor  entries 

5 

for  unique  i  e  I  do 

>  skip  repeated  indices  in  7 

6 

V  =  X/]  ■  Xi2  •  •  •  Xjm  /  X[ 

>  compute  monomial  value  (excluding  xt) 

7 

C  =  NumOcc1(7,  /) 

>  compute  number  of  occurrences 

8 

yt  =  yi  +  c  •  Aj-  x 

>  accumulate  sum 

9 

end  for 

10 

7  =  UpdateIndex(7) 

>  See  Algorithm  4 

11 

end  for 

12 

end  function 

Require:  7  has  length  m  with  entries  in  nondecreasing  order 


13 

function  C  =  NumOcc1(7,  i) 

>  divisor  of 

14 

div  =  1 

15 

curr  =  -1 

>  current  index  value 

16 

mult  =  -1 

>  multiplicity  of  current  index  value 

17 

for  j  =  1  to  m  do 

18 

if  j  ^  first  index  of  i  in  7  then 

>  ignore  one  occurence  of  i 

19 

if  Ij  ±  curr  then 

20 

mult  =  1 

21 

curr  =  Ij 

22 

else 

>  repeated  index 

23 

mult  =  mult  +  1 

24 

div  =  div  •  mult 

>  only  update  divisor  if  mult  >  1 

25 

end  if 

26 

end  if 

27 

end  for 

28 

C  =  (m  -  1)!  /  div 

>  set  c  =  LXI..J 

29 

end  function 

3.2.3.  Index  array  calculations.  We  can  compute  the  index  representation  of  an  index 
class  quickly  by  exploiting  the  lexicographic  ordering  and  computing  each  index  represen¬ 
tation  from  the  previous  one.  That  is,  given  any  index  representation  we  want  to  compute 
the  next  larger  index  representation  in  the  lexicographic  order,  under  the  conditions  that  the 
indices  within  the  index  representation  are  nondecreasing  and  range  between  1  and  n. 

To  find  the  next  representation,  we  seek  to  increment  the  least  significant  possible  index 
(i.e.,  the  rightmost  index  not  equal  to  n).  In  the  example  given  in  Table  3.1,  the  successor  of 
[1, 1, 1]  is  [1, 1,2]  (the  last  index  is  incremented).  More  generally,  suppose  the  kth  index  is 
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the  least  significant  index  not  equal  to  n ,  so  that  the  index  class  is  [i\, . . . ,  4,  n,...,n]2.  Thus, 
this  is  the  largest  representation  with  prefix  [q, . . . ,  4,  •  •  •  ],  so  the  successor  must  have  prefix 
[4,...,  4  +  1,..  .  ].  The  smallest  such  representation  that  satisfies  the  nondecreasing  condition 
is 

[/],...,  4  +  1,4  +  1,  •  •  • ,  4  +  1]- 

For  example,  again  from  Table  3.1,  the  successor  of  [2, 4, 4]  is  [3, 3, 3].  See  Algorithm  4  for 
the  implementation.  In  this  way,  we  can  store  index  information  in  an  array  of  m  integers, 
and  under  the  lexicographic  ordering,  and  updating  the  index  information  for  each  term  in  the 
summation  requires  0(m )  operations. 


Algorithm  4  Update  index  representation  of  unique  entry  in  symmetric  tensor  A  e 

Require:  I  has  length  m  with  entries  in  nondecreasing  order 

1 

function  UpdateIndex(I) 

2 

j  =  m 

3 

while  Ij  ==  n  do 

>  find  least  significant  index  ^  n 

4 

7  =  7-1 

5 

end  while 

6 

ij  =  ij+ 1 

>  increment  least  significant  index  ^  n 

7 

for  k  =  7  +  1  to  m  do 

>  update  less  significant  indices 

8 

4  =  Ij 

9 

end  for 

10 

end  function 

Ensure:  I  is  the  successor  in  lexicographic  ordering  (restricted  to  nondecreasing) 

3.2.4.  Computing  number  of  occurrences.  The  number  of  occurrences  of  each  index 
class  is  given  by  a  multinomial  coefficient  in  terms  of  the  monomial  representation  of  the 
index  class.  Since  we  store  the  index  representation  and  not  the  monomial  representation,  we 
compute  the  multinomial  coefficient  implicitly.  We  can  do  this  by  computing  the  denominator 
with  one  pass  over  the  array  storing  the  index  representation.  The  numerator  is  constant  over 
all  index  classes  and  can  be  precomputed  (either  ml  or  (m  -  1)!  for  the  two  computational 
kernels). 

In  the  case  of  computing  Axm,  the  task  is  to  compute  for  each  index  class  the  product 
k\l---knl,  where  \ki,...,kn]  is  the  monomial  representation  which  is  not  stored  explicitly. 
Note  that  ki  is  the  number  of  occurrences  of  index  i  in  the  index  representation  which  is 
stored  in  memory.  Since  the  index  representation  is  nondecreasing,  repeated  occurrences  of 
an  index  will  be  contiguous.  Thus,  as  we  pass  over  the  index  array,  we  can  multiply  the 
accumulated  product  by  1  for  the  first  occurrence  of  an  index,  by  2  for  the  second  occurrence, 
and  so  on.  For  example,  given  the  index  representation  [1, 2, 2, 5, 5, 5, 5],  the  accumulated 
product  will  bel-l-2-l-2-3-4  =  1 !  *  2!  -4!.  This  approach  yields  the  function  NumOccO 
in  Algorithm  2. 

In  the  case  of  computing  yixm_1 ,  we  take  the  same  approach  to  compute  the  denominator, 
but  we  ignore  one  occurrence  of  the  index  corresponding  to  the  entry  of  the  output  vector 
being  computed.  Following  the  preceding  example,  in  the  case  of  computing  the  5th  element 
of  Axm~\  the  index  representation  [1,2, 2, 5, 5, 5, 5]  would  yield  to  the  accumulated  product 
ll-21-2-3  =  l!-2!-3!.  This  approach  yields  the  function  NumOccI  in  Algorithm  3. 

2Note  that  there  may  be  no  instances  of  index  n  in  the  index  class,  in  which  case  k  =  m,  the  index  class  is 
[4, . . . ,  4],  and  the  successor  is  [4, . . . ,  4  +  !]• 
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In  order  to  avoid  redundant  computation  (at  the  expense  of  extra  storage),  we  can  pre¬ 
compute  the  multinomial  coefficient  ^  k  )  for  each  index  class.  This  is  the  coefficient 
used  in  the  computation  of  Axm,  and  the  coefficients  needed  in  the  computation  of  yixm_1 
can  be  obtained  by  dividing  the  stored  value  by  m  and  multiplying  by  kj  for  appropriate  j. 

3.2.5.  Computational  costs.  All  the  computations  in  the  main  loop  of  Algorithm  2  are 
done  in  0(m)  operations  (floating  point  and  otherwise).  Thus,  the  computational  complexity 
of  computing  Axm  is  O  (m  ■  =  O  (^jr)- 

There  are  nested  loops  in  Algorithm  3,  and  the  inner  loop  requires  m  iterations  in  the 
worst  case.  All  the  computations  in  the  inner  loop  are  done  in  0(m)  operations  (floating 
point  and  otherwise),  so  the  computational  complexity  of  computing  Axm~l  is  O  (m 2  •  = 


4.  GPU  Computing  Overview.  Graphics  processing  units  (GPUs)  were  originally  de¬ 
veloped  and  optimized  to  offload  and  accelerate  graphics  rendering  computations  from  the 
more  general  purpose  microprocessor  or  “central  processing  unit”  (CPU)  on  a  host  computer. 
Graphics  processing  consists  largely  of  data  parallel  computations,  and  GPU  hardware  is 
designed  to  exploit  this  data  parallelism  via  single  instruction/multiple  data  (SIMD)  instruc¬ 
tions.  GPUs  also  exploit  instruction  level  parallelism:  instruction  streams  for  several  threads 
of  execution  are  pipelined  in  order  to  hide  the  latency  of  memory  operations  for  each  thread 
(this  requires  that  the  threads  be  mutually  independent). 

GPU  architecture  is  rapidly  developing  to  meet  the  demands  of  new  applications  and 
users.  Many  of  these  applications  require  high  graphics  rendering  performance,  but  a  grow¬ 
ing  number  of  users  are  interested  in  exploiting  the  computing  power  of  GPUs  for  many  other 
purposes  including  scientific  computing.  To  this  end,  nVidia  has  invested  in  the  development 
of  Compute  Unified  Device  Architecture  (CUDA)  which  is  used  for  general  purpose  pro¬ 
gramming  of  GPUs.  Most  programmers  use  CUDA  as  an  extension  of  the  C  language  which 
gives  access  to  a  set  of  virtual  instructions  for  accessing  the  memory  spaces  and  functional 
units  on  a  GPU. 

Along  with  making  CUDA  freely  available,  nVidia  also  offers  a  software  development 
kit  including  programming  guides,  example  programs,  and  other  documentation  for  program¬ 
mers.  Much  of  the  information  in  the  following  sections  is  available  in  more  detail  in  the 
CUDA  documentation,  particularly  [5,  6]. 

4.1.  Physical  Hardware  Model.  Both  the  computational  units  and  the  memory  hier¬ 
archy  on  GPUs  are  fundamentally  different  from  CPU  architectures.  See  Figure  4.1  for  a 
graphical  representation  of  the  physical  hardware  model. 

Computational  Units.  The  functional  units  on  a  GPU  are  organized  into  groups  which 
concurrently  execute  SIMD  instructions.  In  nVidia  terminology,  each  functional  unit  is 
known  as  a  “processor”  or  “core”,  and  each  group  of  processors  resides  on  a  “streaming 
multiprocessor.”  On  the  GeForce  9800  GT  used  in  our  experiments,  there  are  14  multipro¬ 
cessors  (see  Figure  4.1(a)),  each  with  8  processors  (see  Figure  4.1(b)).  Thus  eight  operations 
can  simultaneously  execute  the  same  instruction  on  different  data  on  a  multiprocessor.  The 
GPU  we  used  is  capable  of  only  single  precision  floating  point  operations,  but  newer  models 
can  execute  double  precision  operations. 

Memory  Hierarchy.  GPUs  have  a  complicated  memory  hierarchy  with  several  differ¬ 
ent  physical  and  logical  memory  spaces.  Note  that  the  memory  hierarchy  discussed  here  is 
only  representative  of  nVidia  GPUs  of  Compute  Capability  1.x;  newer  architectures  of  Com¬ 
pute  Capability  2.x  have  fundamental  differences.  The  largest  memory  is  known  as  “device 
memory”  and  is  accessible  to  all  multiprocessors  on  the  GPU  (see  Figure  4.1(a)).  It  is  also 
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(a)  GPU  card  with  device  memory  and  set  of  stream¬ 
ing  multiprocessors  (SM).  Memory  on  each  SM  shown 
in  (b). 
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(b)  Streaming  multiprocessor  with  on-chip  mem¬ 
ories  and  SIMD  functional  units  (P).  Each  SM  has 
access  to  device  memory  as  shown  in  (a). 

Fig.  4.1.  GPU  Hardware  Model 


accessible  from  the  host  device  (CPU)  and  is  the  means  through  which  the  CPU  and  GPU 
communicate  data.  Except  for  “integrated”  cards,  this  memory  resides  on  the  graphics  card 
itself.  The  memory  access  latency  for  device  memory  to  one  of  the  GPU’s  computational 
units  is  two  orders  of  magnitude  greater  than  the  latency  of  the  on-chip  memory. 

There  are  four  types  of  on-chip  memory:  registers,  shared  memory,  constant  cache,  and 
texture  cache  (see  Figure  4.1(b)).  The  set  of  registers,  or  “register  file,”  is  relatively  large 
but  must  be  divided  up  among  all  threads  resident  on  the  multiprocessor;  it  has  the  smallest 
memory  access  latency  (one  or  two  cycles).  The  shared  memory  is  the  next  fastest  mem¬ 
ory.  It  is  smaller  than  the  register  file  but  can  be  shared  among  threads  in  a  thread  block. 
Shared  memory  can  be  dynamically  allocated  and  can  be  used  as  a  local  store  (i.e.  there  is  no 
hardware-managed  caching  system). 

Some  of  device  memory  can  be  statically  allocated  as  “constant”  memory,  and  accesses 
to  constant  memory  will  be  cached  by  the  hardware.  Constant  memory  is  read-only  for  a 
given  GPU  kernel  function  but  can  be  written  by  the  host  CPU  between  kernel  calls.  A 
“texture”  can  be  bound  to  an  array  in  device  memory  such  that  the  result  of  a  texture  “fetch” 
will  be  cached.  The  texture  caches  on  a  GPU  are  shared  by  two  or  three  multiprocessors.  The 
texture  caching  system  is  designed  to  exploit  2D  spatial  locality,  and  texture  fetches  include 
other  features  designed  to  improve  the  performance  of  certain  relevant  graphics  operations. 
See  Table  4.1  for  the  sizes  of  the  on-chip  memory  for  the  GeForce  9800  GT  card. 

4.2.  CUDA  Programming  Model.  The  simplest  CUDA  programming  model  treats  the 
GPU  as  a  coprocessor  to  the  host  CPU.  That  is,  a  single  thread  of  execution  works  on  the  CPU 
sequentially  until  it  calls  a  “kernel”  function  on  the  GPU  which  is  run  by  many  CUDA  threads 
in  parallel,  and  after  the  kernel  returns,  the  single  CPU  thread  resumes  execution  until  it  calls 
another  kernel  or  terminates.  Multiple  CPU  threads  can  be  used  in  order  to  overlap  CPU 
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Register  file 

8192  registers 

Shared  memory 

16  KB 

Texture  cache 

6-8  KB 

Constant  cache 

8  KB 

Table  4.1 

On-chip  memory  sizes  per  multiprocessor  for  GeForce  9800  GT  ( Compute  Capability  1.1) 


and  GPU  computation,  but  we  only  consider  one  CPU  thread  in  this  work.  Kernel  functions 
may  call  other  functions  to  be  run  on  the  GPU  (which  will  also  run  in  parallel);  these  other 
functions  cannot  be  called  from  host  code.  When  a  kernel  function  is  launched  from  the  host 
code,  the  host  specifies  the  number  of  thread  blocks,  the  number  of  threads  per  block,  and 
optionally  the  amount  of  shared  memory  to  allocate  to  each  thread  block  (all  of  which  can  be 
determined  at  run  time). 

Thread  blocks  are  groups  of  threads  which  are  all  run  on  the  same  multiprocessor.  They 
have  a  common  memory  space  residing  in  the  physical  shared  memory  through  which  the 
threads  can  communicate  and  synchronize.  Thread  blocks  are  logical  entities  and  the  number 
of  threads  per  block  is  unrestricted  up  to  a  certain  maximum;  however,  threads  are  physically 
grouped  into  warps  (the  physical  unit  of  SIMD  instructions)  during  execution,  so  the  number 
of  threads  per  block  should  be  a  multiple  of  the  warp  size  (typically  32). 

The  logical  memory  hierarchy  is  tightly  coupled  to  the  physical  memory.  Registers  are 
local  to  threads,  shared  memory  is  restricted  to  threads  within  a  thread  block,  and  global 
memory  (which  resides  in  “device”  memory)  is  accessible  by  all  threads  and  by  the  host 
code.  Communication  between  thread  blocks  using  global  memory  is  possible  but  rare  be¬ 
cause  thread  blocks  may  be  scheduled  on  any  multiprocessor  in  any  order.  Textures  and 
constant  memory  are  also  globally  accessible  and  are  read-only;  textures  are  accessed  via 
special  texture  fetches.  Another  memory  space  known  as  “local”  memory  is  logically  local 
to  each  thread,  but  the  name  is  misleading  because  local  memory  physically  resides  in  device 
memory.  In  general,  local  memory  is  used  to  handle  register  spilling. 

5.  Detecting  Nerve  Fiber  Direction  in  the  Brain.  We  next  discuss  an  application  well- 
suited  for  computation  on  a  GPU.  It  involves  many  independent  problems  that  can  be  solved 
in  parallel,  and  each  problem  involves  an  amount  of  data  that  is  small  enough  to  reside  in  the 
on-chip  memories  of  the  multiprocessors. 

Diffusion- weighted  magnetic  resonance  imaging  (DW-MRI)  is  a  tool  used  to  detect  nerve 
fibers  in  the  brain.  It  is  a  non-invasive  procedure  that  uses  magnetic  resonance  to  measure 
how  quickly  water  diffuses  in  a  certain  direction.  Water  diffuses  more  quickly  along  the 
longitudinal  axis  of  nerve  fiber  bundles  than  in  any  transverse  or  axial  direction.  DW-MRI 
measurements  are  taken  from  many  different  orientations  for  a  discrete  set  of  voxels  in  the 
brain.  For  each  voxel,  a  diffusion  function  D  :  £  —>  R  which  maps  an  orientation  to  its  rate  of 
diffusion  (here  E  denotes  the  unit  sphere  in  1R3)  is  approximated  using  the  measurement  data. 
For  a  unit  vector  g,  D( g)  is  known  as  the  “apparent  diffusion  coefficient”  (ADC)  [10]. 

When  a  voxel  includes  only  one  fiber  orientation,  the  longitudinal  direction  should  (glob¬ 
ally)  maximize  D  (it  will  exhibit  the  largest  ADC).  When  a  voxel  includes  more  than  one  fiber 
orientation  (in  the  case  of  crossing  fibers),  each  fiber  orientation  should  correspond  to  a  local 
maximum  of  D. 

According  to  [7,  8,  10],  a  common  way  to  approximate  the  diffusion  function  is  as  a  finite 
sum  of  spherical  harmonic  functions  (which  form  a  basis  for  complex  functions  on  the  unit 
sphere).  The  2nd  order  series  (with  6  terms)  corresponds  to  a  quadratic  form 


D( g)  ~  grMg 
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where  M  is  a  symmetric  positive  definite  3x3  matrix.  In  this  case,  at  least  six  measurements 
are  required  to  determine  the  unique  entries  in  the  matrix  M  (or  the  six  coefficients  of  the 
first  spherical  harmonic  functions).  In  the  case  of  a  voxel  with  one  principal  fiber  orienta¬ 
tion,  this  approach  is  usually  sufficient  for  resolving  the  correct  orientation.  However,  in  the 
case  of  fiber  crossings  or  other  complications  such  as  bending  or  fanning  fiber  bundles,  the 
approximation  is  often  unable  to  resolve  the  fiber  directions. 

In  order  to  handle  such  cases,  more  accurate  measurements  and  approximations  are  nec¬ 
essary.  The  approach  is  to  use  higher  order  spherical  harmonic  series  approximations  which 
can  be  represented  not  as  quadratic  forms,  but  more  generally  as  homogeneous  forms.  The 
homogeneous  forms  correspond  to  higher  order  tensors: 

D( g)  «  Agn 

for  some  symmetric  tensor  A  e  R[m,3] .  Note  that  m  must  be  even  since  D( g)  is  a  positive 
physical  quantity  for  all  g  (if  m  is  odd  then  A(-g)m  =  -Agm).  More  DW-MRI  measure¬ 
ments  are  required  to  determine  the  greater  degrees  of  freedom  in  tensors  of  order  m  >  2,  and 
the  higher  order  polynomial  can  better  approximate  the  true  diffusion  function.  Orders  m  =  4 
and  m  -  6  are  most  commonly  used  (m  =  8  requires  120  measurements).  The  correspon¬ 
dence  between  coefficients  of  spherical  harmonic  functions  with  the  entries  in  the  associated 
symmetric  tensor  are  given  in  [10]. 

As  described  in  [3],  the  critical  points  of  the  function  /(x)  =  Axm  and  their  function 
values  are  exactly  the  eigenpairs  of  the  tensor  A  (satisfying  Equation  2.2).  Thus,  in  order 
to  determine  the  principal  fiber  orientations  in  a  given  voxel,  we  can  compute  the  principal 
eigenvectors  of  the  associated  tensor. 

Note  that  specific  instances  of  Properties  3.1  and  3.2  for  ft  =  3  appear  in  the  DW-MRI 
literature.  See  Equations  17  and  19  in  [7],  for  example. 

6.  Implementation  Details.  The  computation  problem  for  the  nerve  fiber  data  is  to  take 
as  input  a  three  dimensional  array  of  symmetric  tensors  and  output  one  or  more  eigenpairs 
for  each  tensor.  The  three  dimensional  array  corresponds  to  the  set  of  voxels  which  discretize 
the  volume  of  a  brain.  The  entries  of  each  tensor  correspond  to  the  coefficients  of  the  ho¬ 
mogeneous  polynomial  which  approximates  the  diffusion  function  for  a  given  voxel.  The 
eigenpairs  which  define  local  maxima  of  the  approximate  diffusion  function  correspond  to 
principal  nerve  fiber  directions  within  the  voxel. 

In  order  to  find  multiple  eigenpairs,  Algorithm  1  must  be  executed  with  different  start¬ 
ing  vectors.  Because  there  is  not  much  theory  to  direct  the  choice  of  starting  vectors  to  find 
all  eigenpairs  corresponding  to  local  maxima,  we  use  many  randomly  chosen  starting  vec¬ 
tors  in  order  to  get  reasonable  coverage  of  the  unit  sphere.  We  choose  random  vectors  by 
independently  selecting  each  vector  entry  uniformly  from  [-1,1]  and  then  normalizing.  Al¬ 
ternatively,  one  could  use  a  deterministic  approach  and  pick  starting  vectors  evenly  spaced 
about  the  sphere. 

The  computational  problem  consists  of  executing  Algorithm  1  with  many  different  ten¬ 
sors  and  many  different  starting  vectors  each.  Since  the  voxel  size  for  DW-MRI  is  on  the 
order  of  one  cubic  millimeter,  the  number  of  voxels  in  a  data  set  for  a  human  brain  can  be 
in  the  millions.  In  order  to  cover  the  sphere,  we  use  somewhere  between  32  and  128  starting 
vectors  for  each  tensor.  With  this  much  inherent  parallelism  in  the  problem,  we  can  easily 
saturate  the  computational  units  on  a  GPU.  The  main  data  structures  involved  in  the  com¬ 
putation  include  the  unique  entries  of  each  tensor,  an  array  of  randomly  generated  starting 
vectors,  an  array  of  output  eigenvectors,  and  an  array  of  output  eigenvalues. 

6.1.  Synthetic  Test  Set.  We  experimented  with  a  synthetic  test  set  provided  by  the  Sci¬ 
entific  Computing  and  Imaging  Institute  at  the  University  of  Utah.  It  consists  of  1024  tensors 
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corresponding  to  a  2D  array  of  voxels  which  includes  some  with  one  and  some  with  two 
principal  fiber  directions  Each  tensor  is  4th  order,  so  each  has  81  total  entries  with  15  unique 
values.  We  chose  to  use  128  starting  vectors  for  each  tensor  in  the  hope  of  reasonably  cov¬ 
ering  the  sphere  in  R3  and  also  because  it  is  a  multiple  of  32,  the  physical  warp  size  on  the 
GPU.  We  used  a  shift  of  a  =  0  as  it  yielded  correct  results  for  the  tensors  in  this  synthetic 
set.  Note  that  a  =  0  implies  that  SS-HOPM  is  the  same  algorithm  as  the  one  given  in  [2,  4]. 
Although  the  performance  of  the  implementation  will  not  vary  much  with  a ,  choosing  an 
appropriate  shift  for  real  data  will  balance  a  tradeoff  between  guarantees  of  convergence  and 
time-to-completion.  To  find  local  maxima,  a  nonnegative  shift  must  be  used. 

6.2.  Thread  Organization.  Because  of  the  number  of  independent  problems,  we  are 
able  to  map  the  computation  to  the  GPU  in  a  straightforward  way  with  minimal  synchroniza¬ 
tion.  We  organize  the  CUD  A  threads  in  the  following  way:  assign  a  thread  block  to  each 
tensor  and  assign  each  thread  in  a  thread  block  to  a  different  starting  vector.  Since  the  number 
of  starting  vectors  is  greater  than  the  warp  size,  each  thread  block  will  utilize  all  the  proces¬ 
sors  on  its  multiprocessor.  Similarly,  as  long  as  the  number  of  tensors  is  at  least  50  or  so,  all 
of  the  multiprocessors  will  be  utilized  with  three  or  four  thread  blocks  each  (multiple  thread 
blocks  are  necessary  to  fill  the  instruction  pipelines). 

6.3.  Data  Structures.  Because  of  the  small  size  of  the  tensors  and  vectors  in  this  prob¬ 
lem,  we  can  fit  all  the  data  for  each  thread  block  in  the  on-chip  memory  and  minimize  the 
accesses  to  device  memory.  Let  T  be  the  number  of  tensors,  U  be  the  number  of  unique 
entries  in  each  tensor,  and  V  be  the  number  of  starting  vectors.  Recall  that  for  this  problem, 
m  =  4,  n  =  3,  T  =  1024,  U  =  15,  and  V  =  128.  For  real  data,  we  expect  T  to  grow  into 
the  millions  but  the  rest  of  the  parameters  will  remain  constant,  though  V  could  be  varied 
experimentally.  The  tensor  data  is  of  size  T  U,  the  array  of  starting  vectors  is  n  x  V,  the  array 
of  output  eigenvectors  is  n  x  (T  •  V),  and  the  array  of  output  eigenvalues  is  of  size  T  •  V.  Note 
that  every  thread  block  can  use  the  same  set  of  starting  vectors,  but  each  has  its  own  set  of 
output  vectors. 

In  addition  to  the  main  data  structures,  we  pre-compute  and  store  the  index  and  multino¬ 
mial  coefficient  information  required  in  Algorithms  2  and  3.  The  index  information  is  stored 
as  an  array  of  size  mxU  and  can  be  shared  by  all  threads.  We  store  the  multinomial  coefficient 
[klm  k)  for  each  unique  tensor  value,  where  [k\, . . . ,  kn\  is  the  monomial  representation  of  the 
index  class  of  the  unique  entry.  In  this  way,  finding  the  number  of  occurrences  of  an  entry 
in  Algorithm  2  is  just  a  look-up,  and  computing  the  related  multinomial  coefficients  used  in 
Algorithm  3,  which  are  of  the  form  ^  ^ )  for  some  i,  can  be  done  by  reading  the  stored 

value,  multiplying  by  k;  and  dividing  by  m?  Thus  the  array  of  multinomial  coefficients  is  of 
size  U .  All  threads  can  share  this  information. 

6.4.  Memory  Management.  We  use  both  the  shared  memory  and  constant  cache  to 
minimize  the  memory  accesses  to  device  memory.  Because  the  index  array  and  multinomial 
coefficients  are  read  only  and  can  be  shared  by  all  the  threads  in  the  computation,  we  designate 
them  as  constant  memory  which  resides  in  global  (device)  memory.  However,  because  that 
information  can  fit  into  the  constant  cache  of  each  multiprocessor,  they  will  be  read  from 
device  memory  to  the  cache  only  once  per  multiprocessor  for  the  entire  computation.  Because 
the  tensor  entries  can  be  shared  by  the  threads  within  one  thread  block,  we  store  them  in  the 
shared  memory.  In  this  way,  the  tensor  entries  are  read  from  device  memory  to  the  on-chip 
shared  memory  only  once  per  thread  block. 


3  One  might  consider  storing  the  “coefficient”  so  that  only  one  multiply  is  needed  to  update  the  stored 

value  for  each  kernel,  but  note  that  this  value  is  not  an  integer  in  general. 
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Fig.  6.1.  Unrolled  computation  of  the  first  entry  of  the  vector  Axm  l,for  A  e  Rl4,3!  The  variables  xl,  x2,  xB 
are  register  variables  which  store  the  input  vector  and  the  Avals  array  is  in  shared  memory. 


Finally,  we  store  the  input  and  output  vectors,  which  are  private  to  each  thread,  in  shared 
memory.  Although  this  data  will  not  be  shared  with  other  threads  in  the  thread  block,  we  use 
the  shared  memory  because  it  is  the  only  on-chip  memory  that  can  be  dynamically  allocated 
and  overwritten.  There  are  two  main  drawbacks  from  using  shared  memory  this  way.  First, 
allocating  2 n  words  of  shared  memory  per  thread  requires  a  lot  of  memory  per  thread  block, 
and  since  the  physical  shared  memory  is  shared  by  all  thread  blocks  on  a  multiprocessor, 
fewer  thread  blocks  can  be  scheduled  simultaneously  on  each  multiprocessor.  The  amount 
of  oversubscription  (known  as  “occupancy”  in  nVidia’s  terminology)  allows  for  pipelining 
instruction  streams  and  hiding  memory  latency.  Second,  the  register  file  is  faster  to  access 
than  shared  memory.  Since  the  number  of  thread  blocks  per  multiprocessor  is  limited  by  the 
shared  memory  requirements,  the  size  of  the  register  file  is  not  being  exploited. 

6.5.  Loop  Unrolling.  For  a  given  order  and  dimension,  we  can  unroll  the  loops  within 
the  two  main  computational  kernels.  This  enables  us  to  exploit  the  register  file  for  storing  the 
input  and  output  vectors  by  statically  allocating  register  variables  corresponding  to  input  and 
output  vector  entries.  Not  only  does  this  expose  instruction-level  parallelism  to  the  compiler, 
it  also  removes  the  indirection  in  accessing  input  and  output  vector  entries.  This  is  possible 
for  small  problems,  but  to  scale  to  larger  problems  we  would  need  a  blocked  approach.  See 
Figure  6.1  for  an  example  of  an  unrolled  loop  in  the  case  m  =  4  and  n-  3.  The  Avals  array 
stores  the  unique  tensor  entries  in  lexicographic  order,  the  input  vector  entries  are  stored  in 
static  variables,  and  the  multinomial  coefficients  are  stored  as  constants  in  the  instruction 
stream.  In  this  case,  the  number  of  terms  in  the  summation  for  Jlxm  is  15,  and  each  of  the 
three  summations  for  the  entries  of  the  output  vector  A,xm~ 1  have  10  terms. 

Without  unrolling  the  loops,  each  access  to  an  input  or  output  vector  requires  two  mem¬ 
ory  operations.  For  example,  if  the  index  information  is  stored  in  an  array  called  index,  then 
accessing  entries  to  the  input  vector  x  take  the  form  x  [index  [k]  ] .  This  indirection  prevents 
the  compiler  from  pipelining  instructions  within  one  thread  and  degrades  performance  even 
if  index  and  x  are  both  on-chip  (see  Section  6.6). 

Note  that  the  arithmetic  intensity  (ratio  of  flops  to  bytes  involved  in  the  computations)  is 
high  for  both  kernels.  In  the  case  m  =  4  and  n  -  3,  there  are  15  unique  tensor  entries  and 
two  vectors  each  with  three  entries,  so  the  number  of  bytes  in  single  precision  is  84  while 
the  number  of  flops  in  computing  Axm~l  is  140.  Another  possible  optimization  would  be  to 
use  common  subexpression  elimination  on  the  unrolled  summations.  For  example,  the  code 
shown  in  Figure  6.1  computes  x2  three  times. 

6.6.  Results.  The  processor  used  for  these  results  is  a  quad-core  Intel  Bloomfield  (Core 
i7).  The  GPU  used  is  an  nVidia  GeForce  9800  GT  which  nVidia  classifies  as  Compute  Capa- 
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(a)  Flop  rates  in  Gflops/s  and  speedup  of  loop  unrolling 


General 

Unrolled 

Unrolled  speedup 

CPU  seq 

0.24 

1.86 

7.86 

CPU  par 

0.92 

6.85 

7.41 

GPU 

5.95 

131.73 

22.15 

(b)  Relative  performance,  normalized  to  se¬ 
quential  implementations 


General  Unrolled 

CPU  seq 

1.00  1.00 

CPU  par 

3.90  3.67 

GPU 

25.08  70.66 

Table  6.1 

Performance  results  for  six  different  implementations  on  all  1024  tensors 


bility  1.1.  The  parallel  CPU  code  was  run  with  four  threads  using  OpenMP.  All  computations 
were  done  in  single  precision  (the  only  precision  available  on  GPUs  of  Compute  Capability 
1.1),  and  we  use  128  starting  vectors  in  all  cases. 

We  report  on  six  different  implementations.  We  benchmarked  a  completely  sequential 
implementation,  using  one  core  on  the  quad-core  CPU;  a  parallel  CPU  implementation,  using 
all  four  cores  of  the  processor;  and  our  GPU  implementation.  In  each  case,  we  benchmarked 
both  the  general  version  of  the  code  and  the  loop-unrolled  version  which  is  specialized  to 
tensors  of  order  4  and  dimension  3.  Note  that  no  memory  hierarchy  optimizations  were  used 
in  the  CPU  implementations. 

Table  6.1  shows  the  performance  results  for  all  six  implementations  computing  the  eigen- 
pairs  for  all  1024  tensors.  Table  6.1(a)  shows  the  absolute  performance  and  gives  the  speedup 
observed  for  each  implementation  by  unrolling  the  loops.  Comparing  the  unrolled  code  to 
the  general  implementations,  we  see  that  unrolling  yields  over  7x  speedup  for  both  CPU  im¬ 
plementations  and  a  22x  speedup  for  the  GPU  implementation.  In  Table  6.1(b)  the  relative 
performance  values  are  normalized  to  the  sequential  CPU  implementations  to  show  parallel 
speedups.  We  observe  that  the  GPU  implementation  achieves  a  speedup  of  20x  over  the  par¬ 
allelized  CPU  implementation.  Although  the  CPU  implementation  was  not  optimized  for  the 
memory  hierarchy,  we  believe  that  because  of  the  large  number  of  independent  problems  in 
this  application,  the  GPU  implementation  will  outperform  the  best  multi-core  implementation 
for  this  test  set.  Future  research  will  explore  which  architecture  is  better  suited  for  computing 
eigenpairs  of  larger  tensors  or  tensor  applications  with  less  inherent  parallelism.  In  either 
case,  developing  high  performing  code  for  general  orders  and  dimensions  will  require  an 
efficient  blocking  strategy  to  allow  for  loop  unrolling  and  the  use  of  register  variables. 

Figure  6.2  shows  performance  results  for  four  different  implementations  for  subsets  of 
the  1024  tensors  in  our  test  set.  Note  that  the  loop  unrolling  makes  a  significant  difference 
in  the  GPU  performance  for  all  problem  sizes.  Because  of  the  independence  of  the  tensor 
eigenproblems,  the  parallel  CPU  implementation  requires  only  a  slight  modification  of  the 
sequential  code  using  OpenMP  pragmas  and  we  observe  close  to  perfect  parallel  scaling  for 
sufficiently  large  problems. 

7.  Conclusions.  In  this  paper  we  present  an  implementation  of  SS-HOPM  targeted  for  a 
GPU.  We  describe  how  to  save  both  storage  and  computation  in  the  two  main  computational 
kernels  of  the  algorithm,  and  for  the  case  of  solving  many  small  tensor  eigenproblems  we 
show  how  to  map  the  computation  onto  a  GPU.  For  our  experimental  data  set,  we  achieved 
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Fig.  6.2.  Performance  results  for  running  SS-HOPM  on  sets  of  4th  order  3-dimensional  tensors  with  128 
starting  vectors  each.  Note  the  y-axis  is  a  log  scale. 


parallel  speedups  of  up  to  70x  over  a  sequential  code  using  the  same  low-level  optimizations 
(but  no  memory  hierarchy  optimizations). 

We  believe  that  the  techniques  for  exploiting  symmetry  may  be  extended  to  other  com¬ 
putations  involving  symmetric  tensors,  but  many  open  questions  remain  about  how  to  write 
sequential  or  parallel  implementations  of  the  computational  kernels  that  scale  to  higher  or¬ 
der  and  higher  dimension  tensors.  We  are  also  interested  in  how  to  map  these  computations 
onto  different  computing  platforms,  including  more  recent  GPUs  which  offer  fundamentally 
different  hardware  features. 
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Uncertainty  Quantification  and  Sensitivity  Analysis 

Uncertainty  quantification  and  sensitivity  analysis  attempt  to  quantify  the  effect  of  varia¬ 
tion  in  model  parameters  has  on  a  physical  model.  The  algorithms  for  computing  these  effects 
can  be  computationally  intensive  and  require  the  development  of  novel  numerical  methods  for 
efficient  solution.  Even  given  an  efficient  algorithm,  the  development  and  application  appro¬ 
priate  methodologies  for  sensitivity  analysis  is  still  an  area  of  active  research.  The  articles 
in  this  section  touch  on  both  efficient  solution  methods  and  methodological  application  and 
development. 

Tipireddy  et  al  compare  a  number  of  preconditioners  for  stochastic  Galerkin  methods. 
The  performance  of  these  methods  are  compared  against  results  for  nonintrusive  stochastic 
Galerkin  methods.  Miller  et  al.  apply  the  stochastic  collocation  method  to  the  drift-diffusion 
equations  for  semiconductor  modeling.  Using  the  results  of  the  collocation  method,  a  sensi¬ 
tivity  analysis  is  performed  gaining  insight  into  the  global  sensitivity  of  the  response  function 
to  the  parameters.  Ahuja  et  al.  develop  Krylov  recycling  methods  that  appropriate  for  rapidly 
converging  linear  systems.  The  performance  of  these  methods  is  demonstrated  on  both  ice 
modeling  problems  and  embedded  uncertainty  quantification  methods.  Blass  and  Romero 
develop  a  method  to  analyze  the  stability  of  a  stochastically  forced  ordinary  differential  equa¬ 
tion.  The  approach  utilizes  analytic  expressions  for  the  eigenvalues  and  functions  of  a  second 
order  differential  operator  to  determine  the  stability.  Proctor  et  al.  consider  a  sensitivity  anal¬ 
ysis  for  a  linear  neutronics  model  for  nuclear  reactors.  This  work  compares  the  performance 
of  adjoint-based  local  and  global  sensitivity  analysis. 
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A  COMPARISON  OF  SOLUTION  METHODS  FOR  STOCHASTIC  PARTIAL 
DIFFERENTIAL  EQUATIONS 

RAMAKRISHNA  TIPIREDDY  §,  ERIC  T.  PHIPPS'!,  AND  ROGER  G.  GHANEM  11 

Abstract.  Several  solution  methods  for  stochastic  Galerkin  discretization  of  partial  differential  equations  (PDEs) 
with  random  input  data  are  compared.  Less  intrusive  approaches  based  on  Jacobi  and  Gauss-Seidel  mean  itera¬ 
tions  are  compared  with  more  intrusive  Krylov-based  approaches.  A  set  of  preconditioners  for  the  Krylov-based 
iterative  methods  to  accelerate  convergence  are  also  examined,  including  mean-based,  Gauss-Seidel,  approximate 
Gauss-Seidel  and  approximate  Jacobi  mean  preconditioners.  All  of  these  methods  are  compared  to  a  non-intrusive 
stochastic  collocation  approach  applied  to  a  canonical  stochastic  diffusion  problem.  For  this  problem,  the  Krylov- 
based  approach  using  approximate  Gauss-Seidel  and  Jacobi  preconditioners  is  found  to  be  most  effective.  Sandia’s 
Trilinos  software  is  used  to  implement  all  the  above  algorithms. 

1.  Introduction.  Real  life  physical  problems  are  often  modeled  as  partial  differential 
equations  (PDEs)  where  input  data  are  treated  as  random  to  represent  uncertainty  in  this  data. 
Monte  Carlo  techniques  are  popular  methods  to  solve  these  problems  as  they  only  require 
solutions  to  the  PDE  for  a  given  set  of  realizations  of  the  input  data.  More  recently  however, 
the  stochastic  finite  element  method  [6,  3]  has  become  a  popular  choice  for  solving  these 
problems  because  of  its  advantages  over  Monte  Carlo  methods.  These  methods  compute 
statistical  properties  of  the  solution  more  efficiently  than  Monte  Carlo  methods. 

Stochastic  finite  element  methods  are  either  intrusive  stochastic  Galerkin  methods  ([13, 
11,  4,  8])  or  non-intrusive  stochastic  collocation  methods  ([15,  18,  10,  2]).  Both  exploit  so¬ 
lution  regularity  to  achieve  higher  convergence  rates  than  Monte  Carlo  methods.  The  first 
approach  translates  the  stochastic  PDE  into  a  coupled  set  of  deterministic  PDEs  while  the 
second  samples  the  stochastic  PDE  at  a  predetermined  set  of  collocation  points,  resulting  in 
a  set  of  uncoupled  deterministic  PDEs.  The  solution  at  these  collocation  points  is  then  used 
to  interpolate  the  solution  in  the  entire  random  input  domain.  Extending  legacy  software  to 
support  stochastic  collocation  methods  is  simpler  than  supporting  SGMs.  Moreover,  intrusive 
SGMs  require  specialized  linear  solvers.  However,  the  resulting  set  of  PDEs  in  the  stochastic 
Galerkin  system  is  much  smaller  in  number  than  that  in  the  collocation  method.  For  a  canon¬ 
ical  random  diffusion  problem,  it  is  shown  [9]  that  SGM  using  iterative  Krylov-based  linear 
solvers  and  mean-based  preconditioning  [12]  is  more  efficient  than  the  non-intrusive  sparse 
grid  collocation  method. 

While  the  stochastic  Galerkin  method  is  often  considered  to  be  a  fully  intrusive  method, 
there  are  in  fact  a  variety  of  solver  approaches  for  the  stochastic  Galerkin  method  that  are 
less  intrusive.  In  this  work,  less  intrusive  Gauss-Seidel  and  Jacobi  mean  solver  methods  are 
compared  to  more  intrusive  Krylov-based  techniques.  We  consider  these  methods  to  be  less 
intrusive  than  the  Krylov-based  methods  as  they  allow  reuse  of  existing  deterministic  solvers. 
Moreover  preconditioning  techniques  for  Krylov-based  methods  based  on  Gauss-Seidel  and 
Jacobi  ideas  are  also  explored  and  compared  to  traditional  mean-based  preconditioning.  All 
of  these  techniques  are  then  compared  to  the  non-intrusive  stochastic  collocation  method  ap¬ 
plied  to  a  canonical  random  diffusion  problem.  These  comparisons  demonstrate  a  trade-off  in 
computational  cost  versus  intrusiveness  with  the  Krylov-based  methods  using  an  approximate 
Gauss-Seidel  or  Jacobi  mean  preconditioner  being  the  most  efficient. 

This  paper  is  organized  as  follows.  In  section  2,  the  model  random  diffusion  problem  is 
formulated.  Two  models  of  the  input  random  field  are  developed  in  section  3  which  dictate 
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very  different  behavior  for  the  stochastic  solution  methods  considered  next.  Section  4  de¬ 
scribes  the  stochastic  Galerkin  method,  and  various  solver  and  preconditioning  methods  are 
introduced.  The  sparse  grid  collocation  method  is  then  reviewed  in  section  5.  In  section  6, 
numerical  experiments  are  carried  out  to  compare  the  efficiency  of  the  various  solver  and  pre¬ 
conditioning  methods  that  have  been  introduced.  Finally  section  7  provides  the  concluding 
remarks. 

2.  Problem  Statement.  In  this  work  a  stochastic  steady  state  elliptic  diffusion  equation 
with  zero  Dirichlet  boundary  conditions  [9]  is  used  as  a  test  problem  for  various  stochastic 
PDE  solution  methods.  Let  D  be  an  open  subset  of  Rn  (for  this  work  we  assume  n  -  2)  and 
(Q,  X,  P)  be  a  complete  probability  space  with  sample  space  Q,  <x-algebra  2  and  probability 
measure  P.  Assume  a(x,  oj)  :  D  x  El  — »  R  is  a  random  field  that  is  bounded  and  strictly 
positive,  that  is, 


0  <  ai  <  a(x,  oj)  <  au  <  oo  a.e.  in  D  x  Q.  (2.1) 

We  wish  to  compute  a  random  field  u(x,  oj)  :  D  x  El  — >  R,  u  e  Hl(D)  0  Li{El)  such  that  the 
following  holds  P-almost  surely  (P-a.s.): 


-V  •  (i a(x ,  c (j)Vu(x,  oj))  =  f{x,  oj)  in  D  x  Q,  (2.2) 

u(x ,  oj)  =  0  on  dD  x  El.  (2.3) 

Let  Hq(D)  be  the  subspace  of  the  Sobolev  space  Hl(D)  that  vanishes  on  the  boundary  dD  and 
is  equipped  with  the  norm  ||w||^ i(0)  =  [J^  \Vu\2dx]K  Problem  2.2  can  then  be  written  in  the 
following  equivalent  variational  form  [7]:  find  u  e  Hq(D)  0  L2(H)  such  that 

b(u ,  v)  =  Z(v),  Vv  e  Hq(D)  0  L2(D),  (2.4) 


where,  b{u ,  v)  is  the  continuous  and  coercive  (from  assumption  2.1)  bilinear  form  given  by 


b{u ,  v)  =  E 


X 


aVu  •  Vvdx 


,  'iu,  v  e  H'()(D)  <g>  L2(Q), 


(2.5) 


and  liy)  is  the  continuous  bounded  linear  functional  given  by 


/(v)  =  E 


Vv  e  Hq(D)  <8  L2(Q). 


(2.6) 


Here  £'[•]  denotes  mathematical  expectation.  From  the  Lax-Milgram  lemma,  Eq.  2.4  has 
unique  a  solution  in  H^{D)  0  L2(D). 

3.  Input  random  field  model.  For  computational  purposes,  the  diffusion  coefficient 
a(x,  oj)  must  be  discretized  in  both  the  spatial  and  stochastic  domains.  To  this  end,  it  is  often 
approximated  with  a  truncated  series  expansion  that  separates  the  spatial  variable  x  from  the 
stochastic  variable  oj  resulting  in  a  representation  by  a  finite  number  of  random  variables.  For 
this  representation,  second  order  information  such  as  the  covariance  function  of  the  random 
field  is  required.  In  the  present  problem,  two  cases  of  random  field  models  are  considered. 
In  the  first  case,  the  random  field  is  assumed  to  be  uniformly  distributed  and  is  approximated 
through  a  truncated  Karhunen-Loeve  expansion.  In  the  second  case,  the  random  field  is 
assumed  to  have  a  log-normal  distribution,  that  is  a(x ,  oj)  =  exp  (g(x,  oj))  where  g(x,  oj)  is  a 
Gaussian  random  field,  and  is  approximated  by  a  truncated  polynomial  chaos  expansion. 
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3.1.  Karhunen-Loeve  expansion.  Let  C(x  1,^2)  =  E[a(x\,  oj)a(x2,  oj)]  be  the  covari¬ 
ance  function  of  the  random  field  a(x ,  oj).  Then  a  can  be  approximated  through  its  truncated 
Karhunen-Loeve  (K-L)  expansion  [6]  given  by 


M 

a(x,oj)  ~  a(x,£(oj))  -  ao(x)  +  2  (3.1) 

*'=  1 

where  ^o(v)  is  the  mean  of  the  random  field  a(x ,  tu)  and  {(T*,  0i(*))}i>i  are  solutions  of  the 
integral  eigenvalue  problem 


X 


C(X\,  x2)ai(x2)dx2  =  A^xi). 


(3.2) 


The  eigenvalues  T,  are  positive  and  non-increasing,  and  the  eigenfunctions  «,(x)  are  orthonor¬ 
mal,  that  is, 


X 


=  cL 


(3.3) 


where  <^-7-  is  the  Kronecker  delta.  In  Eq.  3.1,  {^7}^  are  uncorrelated  random  variables  with 
zero  mean.  As  a  first  test-case,  the  diffusion  coefficient  a(x ,  oj)  is  modeled  with  an  exponential 
covariance  function 


C(x ux2)  =  cr2exp(-||xi  -  x2||i /L)  (3.4) 

and  uniformly  distributed  random  variables  £(cu).  We  further  assume  the  random  variables 
are  independent. 

3.2.  Polynomial  chaos  expansion.  The  K-L  expansion  above  approximates  a  random 
field  by  a  linear  combination  of  a  finite  set  of  random  variables.  To  maintain  positivity 
of  the  random  field,  such  a  representation  is  only  appropriate  if  the  random  variables  are 
bounded  [16].  For  unbounded  random  variables  (e.g.,  log-normal)  a  nonlinear  polynomial 
chaos  representation  is  more  appropriate.  The  polynomial  chaos  expansion  [6,  17]  is  used 
to  approximate  a  random  field  in  terms  of  multi-variate  orthogonal  polynomials.  Let  £  = 
(£1 ,  •  •  •  ,  %m)t  be  the  random  variables  from  a  truncated  K-L  expansion  of  a  given  random 
field  g(x,  oj),  that  is 


M 

g(x,  co)  a  g(x,  £(«))  =  go(x)  +  ^  ■sJIigi(x)£i(aj).  (3.5) 

*=  1 

Assume  a(x,  oj)  is  then  given  by  a  nonlinear  transformation  of  g(x,  oj).  Then  a{x ,  oj)  can  be 
represented  through  nonlinear  functionals  of  the  random  variables  &(oj).  It  has  been  shown 
in  [17,  6]  that  this  functional  dependence  can  be  expanded  in  terms  of  multi-dimensional 
orthogonal  polynomials,  called  polynomial  chaos,  as 


00  00  i\ 

a(x,  oj)  =  clq(x)  +  z  ah(x)T  {(£h(co))  +  zz  dinJiMu  M,  4 M)  +  •  •  •  (3-6) 

*1=1  *1  =  1  *2  =  1 

where  Tn(^ix ,  •  ■  ■  ,&„)  is  the  multi-dimensional  polynomial  chaos  of  order  n  in  random  vari¬ 
ables  (£7,  ■  •  •  ,£in).  A  one-to-one  mapping  of  polynomials  {T/}  to  a  set  of  polynomials  with 
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ordered  indices  {t/q(£)}  can  be  introduced  [6].  After  substituting  {^}  in  Eq.  3.6  and  truncating 
the  series  to  finite  number  of  terms  Ng,  the  random  field  a{x ,  oj)  can  thus  be  approximated  as 

* 

a(x,  oj)  ~  a(x,  f  M)  =  ao(x)  +  ^  afx)^^).  (3.7) 

i=  1 


The  polynomials  {t/q(f )}  are  orthogonal  with  respect  to  the  inner  product  defined  by  expecta¬ 
tion  in  the  stochastic  space, 


f  fr(mWj<m)dP(a>)  =  6iJ.  (3.8) 

Jn 


As  a  second  test-case,  the  diffusion  coefficient  a(x,  oj),  is  modeled  as  a  log-normal  ran¬ 
dom  field  [5]  where  a(x,  oj)  =  exp (g(x,  oj)  and  g(x,  oj)  is  a  Gaussian  random  field  with  expo¬ 
nential  covariance  (3.4)  and  approximated  with  a  truncated  K-L  expansion  (3.5).  In  this  case 
the  random  variables  £  are  standard  normal  random  variables  and  thus  are  independent.  It 
also  can  be  shown  that  the  polynomials  {t/q}  are  tensor  products  of  one-dimensional  Hermite 
polynomials.  For  a  given  total  polynomial  order  p,  the  total  number  of  polynomials  {t/q(f)}  is 


Ne  +  1 


_  (M+p)\ 
M\p\ 


4.  Stochastic  Galerkin  method.  In  the  stochastic  Galerkin  method,  we  seek  the  solu¬ 
tion  of  the  variational  problem  2.4  in  a  tensor  product  space  Xh®Yp,  where,  Xh  c  H^(D)  is 
finite  dimensional  space  of  continuous  polynomials  corresponding  to  the  spatial  discretiza¬ 
tion  of  D  and  Yp  c  L2(El)  is  the  space  of  random  variables  spanned  by  polynomial  chaos  [6] 
of  order  up  to  p.  Then  the  finite  dimensional  approximation  uXhYp{x ,  <*>)  of  the  exact  solution 
u(x,  oj)  on  the  tensor  product  space  Xh  ®  Yp  is  given  as  the  solution  to 


b(uXhYp,v)  =  Z(v)  \/veXh®Yp.  (4.1) 

In  Eq.  4.1  the  random  field  a(x,  oj)  in  the  bilinear  form  b(uXhYp,v)  can  be  approximated 
using  either  a  K-L  expansion  or  a  polynomial  chaos  expansion  depending  on  the  either  linear 
or  nonlinear  dependence  of  the  random  field  on  the  input  random  variables.  The  resulting  set 
of  coupled  PDEs  are  then  discretized  using  standard  techniques  such  as  the  finite  element  or 
finite  difference  methods.  In  the  former  case,  a  trial  function,  uXhYp  can  be  written  as 

uxhY„(x,  £)  =  E  uijNi(x)i//j(g),  (4.2) 

ij 

where  {Ni(x)}  and  {if/ j(g)}  are  the  finite  element  shape  functions  and  polynomial  chaos  poly¬ 
nomials  respectively.  Substituting  the  trial  function  uXhYp(x ,£)  and  the  test  function  v(v,£)  = 
Nk(x)i//i(g)  in  Eq.  4.1,  the  discretized  equations  can  be  written  as 

N€  P 

X  Yj  c‘jkKiUj  =  fk,  k  =  0,  •  •  •  ,  %  (4.3) 

7=0  i= 0 

where  fk  =  £{/(v,£)</^},  and  jif/k)  and  P  =  M  when  a  is  approximated  by  a 

truncated  K-L  expansion,  or  Cijk  =  E{iffii/jjilJk}  and  P  =  when  a  is  approximated  by  a 
polynomial  chaos  expansion.  Here  {Kt  e  RNxXNx}^=0  are  the  polynomial  chaos  coefficients  of 
the  stiffness  matrix  (section  (3.4)  of  [12]) 

(Ki)im  =  I  at(x)VNi(x)  •  VNm(x)dx,  i  =  0, . . . ,  P,  /,  m  =  1, . . . ,  Nx, 

Jd 


(4.4) 
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and  [Uj  e  RNx  }^f0  are  the  polynomial  chaos  coefficients  of  the  discrete  finite-element  solution 
vector 

Uj  =  [uqj,  . . . ,  uNxj]r,  j  =  0,...,N€.  (4.5) 

{Ki}^=0  and  {Ki}?=2  are  symmetric  positive  definite  and  symmetric  indefinite  matrices  respec¬ 
tively.  Equation  4.3  can  be  written  in  the  form  of  a  global  stochastic  stiffness  matrix  of  size 
m  +  1)  X  Nx )  by  ((A£  +  1)  x  tf,)  as 


K°-°  K0’1  ■ 
K1’ 0  K1’1  ■ 

. .  KhNt 

X  < 

U\ 

u2 

y  ~  < 

f\  \ 

fl 

Kn(’  o  KNf’1  ■ 

<  UNf  , 

„  /n(  , 

where  =  2/lo  cijk^i-  We  will  denote  this  system  as  Ku  =  /.  In  practice  it  is  prohibitive 
to  assemble  and  store  the  global  stochastic  stiffness  matrix  in  this  form,  rather  each  block  of 
the  stochastic  stiffness  matrix  can  be  computed  from  the  {A'V}  when  needed. 

4.1.  Solution  methods  for  stochastic  Galerkin  systems.  In  this  section,  various  solver 
techniques  and  preconditioning  methods  for  solving  the  linear  algebraic  equations  arising 
from  stochastic  Galerkin  discretizations  (4.3)  are  described.  The  solver  methods  discussed 
are:  a  Jacobi  mean  method,  a  Gauss-Seidel  mean  method,  and  Krylov-based  iterative  meth¬ 
ods  [14].  Also  various  stochastic  preconditioners  used  to  accelerate  convergence  of  the 
Krylov  methods  are  discussed,  including  mean-based  [12],  Gauss-Seidel  mean,  approximate 
Gauss-Seidel  mean  and  approximate  Jacobi  mean  preconditioners.  In  Jacobi  and  Gauss- 
Seidel  methods,  mean  splitting  is  used  rather  than  traditional  diagonal  block  splitting  as  it 
allows  use  of  the  same  mean  matrix  Ko  for  all  inner  deterministic  solves  (and  thus  reuse  of 
the  preconditioner  Pq  «  Ko). 

Jacobi  mean  algorithm.  In  this  method,  systems  of  equations  of  size  equal  to  that  of 
the  deterministic  system  are  solved  iteratively  by  updating  the  right-hand- side  to  obtain  the 
solution  to  the  stochastic  Galerkin  system  of  equations  (4.3): 

Nf  P 

cmK{)unjw  =  /*-££  cijkKiUf,  k  =  0,---,Ns.  (4.7) 

j= 0  i= 1 

The  above  system  of  equations  are  solved  for  k  =  0,  •  •  ■  ,  N%  using  any  solution  technique 
appropriate  for  the  mean  matrix  Kq.  Thus  existing  legacy  software  can  be  used  with  min¬ 
imal  modification  to  solve  the  stochastic  Galerkin  system.  In  this  work,  Krylov-based  it¬ 
erative  methods  with  appropriate  preconditioners  will  be  used.  One  cycle  of  solves  from 
k  =  0,  •  •  •  ,  A^r  is  considered  one  Jacobi  outer  iteration,  and  after  each  outer  iteration,  the  right 
hand  side  in  Eq.  4.7  is  updated  replacing  {u°ld}  with  the  new  solution  {unjew}.  These  outer  itera¬ 
tions  are  continued  until  the  required  convergence  tolerance  is  achieved.  Note  that  for  a  given 
outer  iteration,  all  of  the  right-hand- sides  for  k  =  0,  •  •  •  ,Ng  are  available  simultaneously,  and 
thus  their  solution  can  be  efficiently  parallelized.  Moreover  block  algorithms  optimized  for 
multiple  right-hand- sides  may  be  used  to  further  increase  performance.  Finally  this  approach 
does  not  require  a  large  amount  of  memory  to  compute  the  solution.  The  disadvantage  of  the 
method  is  it  may  not  converge  or  converge  very  slowly. 
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Gauss-Seidel  mean  iterative  method.  The  Gauss-Seidel  method  considered  is  similar 
to  the  the  Jacobi  method  above,  except  the  right-hand- side  in  Eq.  4.7  is  updated  after  each 
deterministic  solve  with  the  newly  computed  unkew .  Symbolically  this  is  written 

k- 1  P  p 

ckk0K0unkew  =  A- Yj  Z  ciJkK‘u7W  -  Z  Z  cHkKiU°ld,  k  =  0,---,Ng.  (4.8) 

7=0  i=  1  j=k  tm  1 

As  before,  one  cycle  of  solves  from  k  =  0,  •  •  •  ,  Ng  is  considered  one  outer  iteration  of  the 
Gauss-Seidel  method,  and  these  outer  iterations  are  repeated  until  the  required  convergence 
tolerance  is  achieved.  Often  this  method  converges  in  fewer  iterations  than  the  Jacobi  method, 
at  the  expense  of  no  longer  having  all  of  the  right-hand- sides  available  simultaneously.  This 
requires  recomputing  needed  matrix- vector  products  Kiuyw  for  each  k ,  which  adds  additional 
computational  cost,  or  storing  them  as  they  are  computed,  which  adds  additional  memory 
requirements.  In  both  the  Jacobi  and  Gauss-Seidel  methods,  the  left  hand  side  matrix  is  the 
mean  matrix  for  all  inner  deterministic  problems  and  only  the  right  hand  side  changes.  In 
such  cases  recycled  Krylov  basis  methods  can  be  explored  to  increase  performance. 

Krylov  based  iterative  methods  with  matrix-free  operations.  Krylov  based  iterative 
methods  [14]  such  as  the  conjugate  gradient  (CG)  method  and  the  generalized  minimal  resid¬ 
ual  (GMRES)  method  can  be  used  to  solve  the  stochastic  Galerkin  system  (4.3)  in  which 
matrix  vector  products  v  =  Ku  are  computed  using  “matrix  free”  operations: 

P 

Vk  ~  E  E  CiikK‘uJ’  k  =  0,  -  ■■  ,Ng.  (4.9) 

j= o  i=0 

If  the  matrix  vector  products  are  computed  from  Eq.  4.9,  it  is  not  required  to  assemble  the  full 
stochastic  Galerkin  stiffness  matrix,  drastically  decreasing  memory  requirements.  However 
if  a  large  number  of  iterations  of  a  Krylov  method  such  as  GMRES  are  required,  allocation  of 
the  Krylov  basis  may  still  require  a  very  large  amount  of  memory.  Thus  good  preconditioning 
strategies  for  the  stochastic  Galerkin  system  are  required,  several  of  which  will  be  discussed 
below. 

Mean-based  preconditioner.  The  mean-based  preconditioner  [12]  is  given  by  P  = 
diag{Po,  •  •  •  ,  Po)  where  Po  «  K$  is  a  preconditioner  for  the  mean.  The  mean-based  pre¬ 
conditioner  is  very  efficient  to  compute  and  apply,  since  it  only  must  be  generated  once  from 
a  matrix  that  is  of  the  size  of  the  deterministic  system.  However  it  doesn’t  incorporate  any 
higher-order  stochastic  information,  thus  its  performance  degrades  as  the  stochastic  dimen¬ 
sion,  polynomial  order,  or  random  field  variance  increases  [16]. 

Gauss-Seidel  preconditioner.  One  or  more  outer  iterations  of  the  Gauss-Seidel  mean 
algorithm  can  be  used  as  a  preconditioner  to  the  Krylov  based  iterative  methods.  An  ad¬ 
vantage  of  this  method  is  that  the  cost  of  applying  the  preconditioner  can  be  controlled  by 
adjusting  the  tolerance  of  the  inner  deterministic  solves  and  number  of  outer  iterations.  De¬ 
creasing  this  tolerance  and  increasing  the  number  of  outer  iterations  will  reduce  the  number 
of  iterations  in  the  Krylov  method,  but  make  the  preconditioner  more  expensive  to  apply, 
and  thus  these  must  be  balanced  to  minimize  overall  computational  cost.  Generally  we  have 
found  the  cost  of  the  preconditioner  to  be  dominated  by  solving  the  mean  systems,  and  thus 
the  best  choice  was  a  very  loose  inner  solver  tolerance  (i.e.,  0.1)  and  only  one  Gauss-Seidel 
iteration.  However  to  prevent  stagnation  of  the  outer  Krylov  solver,  a  flexible  variant  of  the 
Krylov  method  (e.g.,  FGMRES)  was  necessary. 
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Approximate  Gauss-Seidel  preconditioner.  The  process  of  increasing  the  inner  solver 
tolerance  can  be  taken  to  its  extreme  of  replacing  the  inner  mean  solves  by  application  of 
the  mean  preconditioner.  As  with  the  Gauss-Seidel  preconditioner  above,  we  found  exper¬ 
imentally  that  this  approach  worked  best  with  only  one  Gauss-Seidel  iteration,  and  adding 
additional  iterations  did  not  improve  the  quality  of  the  preconditioner.  We  also  found  the  cost 
of  the  preconditioner  was  reduced  dramatically  if  only  the  first-order  terms  in  the  expansion 
for  the  stiffness  matrix  are  used  in  the  preconditioner  and  using  higher-order  terms  did  not 
improve  performance.  We  refer  to  this  as  the  approximate  Gauss-Seidel  preconditioner. 

Approximate  Jacobi  preconditioner.  Similar  to  the  approximate  Gauss-Seidel  precon¬ 
ditioner,  Jacobi  iterations  can  be  used  using  a  preconditioner  in  place  of  the  mean  stiffness 
matrix.  In  this  case  we  used  two  outer  Jacobi  iterations,  since  the  first  iteration  is  equivalent 
to  mean-based  preconditioning  (i.e.,  the  additional  terms  on  the  right-hand- side  of  Eq.  4.7  are 
zero).  Increasing  the  number  of  outer  iterations  did  not  improve  the  efficiency  of  the  overall 
solver.  We  refer  to  this  as  the  approximate  Jacobi  preconditioner. 

5.  Sparse  grid  collocation  method.  In  the  collocation  method,  the  solution  to  the  PDE 
is  sampled  at  a  pre-selected  set  of  points  called  collocation  points,  0  =  (£(1),  •  •  •  ,f(A^).  The 
stochastic  solution  is  constructed  by  interpolating  at  these  collocation  points, 

N 

u(x ,  f )  ^ki.x)Lki^')  (5.1) 

k= 0 

where  {Z^(£)}  are  Lagrange  interpolatory  polynomials  defined  by  ^  (!*(£/)  =  Ski )  and  Uk  is 
the  solution  of  following  deterministic  PDE, 

- V.(a(*.  t])Vuk(x)  =  f{x,  tfky)  in  D,  (5.2) 

Uk(x)  =  0  on  dD.  (5.3) 

The  collocation  points  can  be  chosen  as  tensor  products  of  1-D  Gaussian  quadrature  points 
and  the  interpolating  polynomials  as  tensor  products  of  1-D  Lagrange  interpolating  functions. 
However  the  number  of  collocation  points  then  grows  exponentially  with  the  number  of  ran¬ 
dom  variables.  An  alternative  method  is  to  use  Smolyak  sparse  grid  quadrature  ([15,  10]) 
where  collocation  points  that  do  not  increase  asymptotic  accuracy  are  removed  from  the  ten¬ 
sor  product  grid.  This  results  in  many  fewer  collocation  points  but  still  more  than  the  number 
of  stochastic  degrees  of  freedom  in  the  stochastic  Galerkin  method.  This  method  is  fully 
non-intrusive  and  is  easy  to  implement  with  existing  legacy  software  (once  the  sparse  grid  is 
generated)  [1]. 

6.  Numerical  illustration.  To  compare  the  performance  of  the  different  solvers  and 
preconditioners  discussed  above,  the  2-D  stochastic  diffusion  problem  presented  in  section  2 
is  solved  using  both  the  stochastic  Galerkin  and  stochastic  collocation  methods  from  sec¬ 
tions  4  and  5.  For  both  solution  approaches,  the  random  field  is  treated  as  both  a  uniform 
random  field  discretized  using  a  truncated  K-L  expansion  (section  3.1)  and  a  log-normal  ran¬ 
dom  field  discretized  using  a  truncated  polynomial  chaos  expansion  (section  3.2).  In  the  first 
case,  the  orthogonal  polynomials  used  in  the  stochastic  Galerkin  method  are  tensor  products 
of  1-D  Legendre  polynomials  whereas  the  collocation  points  used  in  the  sparse  grid  stochas¬ 
tic  collocation  method  are  built  from  Gauss-Legendre  points,  and  in  the  second  case  tensor 
products  of  Hermite  polynomials  and  Gauss-Hermite  quadrature  points  are  used.  The  Dakota 
package  [1]  is  used  to  generate  the  resulting  sparse  stochastic  collocation  grids.  The  spa¬ 
tial  dimensions  are  discretized  using  a  five-point  finite-difference  stencil  on  a  32  x  32  grid 
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in  the  domain  D  -  [0, 1]  x  [0, 1],  resulting  in  a  total  number  of  spatial  degrees  of  freedom 
Nx  =  1024.  For  simplicity  a  constant  unit  force  /(v,  oj)  =  1  is  used  as  the  right-hand- side  in 
Eq.  2.2.  The  corresponding  stochastic  Galerkin  linear  system  is  constructed  using  the  Stokhos 
and  Epetra  packages  in  Trilinos.  For  the  Jacobi  solver,  Gauss-Seidel  solver,  Gauss-Seidel  pre¬ 
conditioner,  and  stochastic  collocation  method,  the  linear  systems  are  solved  via  multi-grid 
preconditioned  GMRES  provided  by  the  AztecOO  and  ML  Trilinos  packages.  For  a  con¬ 
sistent  comparison  of  all  of  the  preconditioning  methods,  FGMRES  provided  by  the  Belos 
Trilinos  package  is  used  as  the  outer  Krylov  solver,  with  ML  providing  the  preconditioner  in 
the  mean-based  and  approximate  Gauss-Seidel  and  Jacobi  preconditioners.  GMRES  Krylov 
methods  are  employed  instead  of  CG  for  generality  and  the  numerical  implementation  of  the 
boundary  conditions  resulted  in  unsymmetric  matrices  Kt. 

The  solution  time  for  these  solvers  and  preconditioning  techniques  as  a  function  of  the 
standard  deviation  of  the  input  random  field,  stochastic  dimension,  and  polynomial  order  are 
tabulated  in  Tables  6. 1-6.6.  In  the  tables,  MB,  AGS,  AJ  and  GS  are  the  mean-based,  approx¬ 
imate  Gauss-Seidel,  approximate  Jacobi,  and  Gauss-Seidel  preconditioners  respectively  for 
the  FGMRES  Krylov  method.  GS  \  and  GS  2  are  Gauss-Seidel  solvers  where  GS  1  refers  to 
the  Gauss-Seidel  algorithm  in  which  the  matrix  vector  products  KiUj  are  saved  in  an  array  for 
reuse  in  later  iterations  of  the  “k”  loop,  whereas  GS  2  refers  to  the  variant  where  these  prod¬ 
ucts  are  recomputed  when  needed.  GS  1  is  generally  more  efficient,  but  for  higher  stochastic 
dimension  or  polynomial  order,  it  requires  a  large  amount  of  memory.  “Jacobi”  refers  to  the 
Jacobi  mean  solver,  and  “collocation”  is  the  solution  time  using  the  Smolyak  sparse  grid  col¬ 
location  method.  The  solution  tolerance  for  all  of  the  stochastic  Galerkin  solvers,  as  well  as 
the  solver  tolerance  for  the  collocation  method  is  le  -  12.  For  the  Gauss-Seidel  and  Jacobi 
solvers,  the  inner  solver  tolerance  is  3e  -  13,  and  for  the  Gauss-Seidel  preconditioner,  the 
inner  solver  tolerance  is  0.1. 

In  the  tables,  DNC  means  “did  not  converge”,  “Div.”  means  diverged,  and  “memory” 
means  system  memory  was  exceeded.  For  the  uniform  random  field  with  small  variance 
(<r  =  0.1),  it  can  be  observed  from  Tables  6.1  and  6.2  that  the  more  intrusive  Krylov-based 
stochastic  Galerkin  solvers  are  more  efficient  than  the  less  intrusive  Gauss-Seidel  and  Jacobi 
solvers,  which  are  in  turn  generally  more  efficient  than  the  non-intrusive  stochastic  collo¬ 
cation  method.  Moreover  the  approximate  Gauss-Seidel  and  Jacobi  preconditioners  are  a 
significant  improvement  over  the  traditional  mean-based  approach.  However  as  the  variance 
of  the  random  field  is  increased,  we  see  from  Table  6.3  the  Gauss-Seidel  and  Jacobi  solvers 
suffer  considerably,  whereas  the  the  Krylov-based  approaches  (excluding  the  Gauss-Seidel 
preconditioner)  still  perform  quite  well.  This  is  not  unexpected,  as  the  operator  becomes 
more  indefinite  as  the  variance  increases.  However  for  the  log-normal  random  field,  we  see 
from  Tables  6.4  and  6.5  that  the  Krylov-based  stochastic  Galerkin  approach  is  only  more 
efficient  than  the  collocation  approach  for  larger  stochastic  dimension  or  polynomial  order 
when  using  the  approximate  Gauss-Seidel  or  approximate  Jacobi  preconditioners.  It  is  also 
interesting  to  see  that  Gauss-Seidel  solver,  GS  1  is  faster  than  GMRES  with  mean-based  pre¬ 
conditioning  in  this  case.  For  higher  variance  of  the  random  field,  we  see  from  Table  6.6  the 
Krylov  iterative  method  with  the  approximate  Jacobi  preconditioner  failed  to  converge  and 
the  Jacobi  solver  diverged.  This  problem  can  be  rectified  by  using  the  true  diagonal  matrix 
Kkjk  =  cikkKi  from  global  stochastic  stiffness  matrix  as  the  left-hand-side  in  the  Jacobi 
solver  and  preconditioner  instead  of  the  mean  matrix  Kq. 

Figures  6.1(a)  and  6.1(b)  show  a  plot  of  relative  residual  error  vs  iteration  count  for 
the  stochastic  Galerkin  system  with  stochastic  dimension  5  and  polynomial  order  5.  It  can 
be  observed  that  the  Gauss-Seidel  solver  takes  the  least  number  of  iterations  where  as  the 
Jacobi  solver  takes  highest  number  of  iterations  for  a  given  tolerance.  However  in  terms  of 
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Table  6.1 

Solution  time  (sec)  vs  stochastic  dimension  for  uniform  random  field,  PC  order  =  5  and  cr=0.1 


Stoch. 

dim 

Preconditioners  for  GMRES 

GS  and  Jacobi  Solvers 

SprseGrid 

collocation 

MB 

AGS 

AJ 

GS 

GS  i 

gs2 

Jacobi 

2 

0.20 

0.12 

0.18 

0.25 

1.23 

1.21 

2.22 

0.52 

3 

0.70 

0.39 

0.54 

0.77 

3.84 

3.87 

7.46 

3.18 

4 

1.78 

1.01 

1.38 

2.02 

9.56 

9.73 

18.60 

10.24 

5 

4.34 

2.31 

3.05 

4.59 

20.40 

20.90 

41.10 

26.96 

6 

10.24 

5.41 

7.10 

9.81 

46.10 

46.30 

87.50 

64.09 

7 

19.50 

10.24 

12.96 

19.64 

81.20 

80.40 

160.00 

134.45 

solution  time,  the  matrix  free  Krylov  solver  with  the  approximate  Gauss-Seidel  or  Jacobi 
preconditioner  is  the  most  efficient. 


Table  6.2 

Solution  time  (sec)  vs  order  of  polynomial  chaos  when  diffusion  coefficient  is  uniform  random  field,  Stoch. 
dim=3,  cr  -  0.1 


PC 

order 

Preconditioners  for  GMRES 

GS  and  Jacobi  Solvers 

SprseGrid 

collocation 

MB 

AGS 

AJ 

GS 

GS  i 

GS  2 

Jacobi 

2 

0.10 

0.05 

0.07 

0.11 

0.47 

0.45 

0.83 

0.11 

3 

0.19 

0.12 

0.16 

0.24 

1.10 

1.08 

2.09 

0.42 

4 

0.39 

0.23 

0.32 

0.47 

2.20 

2.18 

4.15 

1.27 

5 

0.69 

0.39 

0.54 

0.77 

3.80 

3.88 

7.47 

3.21 

6 

1.08 

0.90 

0.91 

1.54 

6.22 

6.61 

12.3 

7.06 

7 

1.63 

1.05 

1.27 

1.87 

9.16 

9.33 

17.90 

14.20 

8 

2.51 

2.65 

2.14 

3.53 

13.10 

14.20 

26.10 

26.08 

9 

3.81 

3.43 

2.94 

5.16 

18.20 

20.40 

36.40 

46.72 

10 

5.22 

4.44 

3.92 

6.73 

24.40 

27.30 

47.30 

78.87 

Table  6.3 

Solution  time  (sec)  vs  standard  deviation  (cr)  when  diffusion  coefficient  is  uniform  random  field,  Stoch  dim  =  3 
and  PC  order  =  5 


cr 

Preconditioners  for  GMRES 

GS  and  Jacobi  Solvers 

SprseGrid 

collocation 

MB 

AGS 

AJ 

GS 

GS  i 

gs2 

Jacobi 

0.10 

0.66 

0.40 

0.59 

0.77 

3.80 

3.88 

7.37 

3.18 

0.11 

0.75 

0.46 

0.64 

0.87 

4.73 

4.79 

9.15 

3.32 

0.12 

0.88 

0.50 

0.72 

0.98 

5.83 

5.94 

11.50 

3.46 

0.13 

1.09 

0.58 

0.80 

1.17 

7.67 

7.83 

14.90 

3.63 

0.14 

1.38 

0.69 

1.00 

1.48 

10.80 

11.00 

21.20 

3.85 

0.15 

1.91 

0.97 

1.29 

1.89 

15.60 

18.00 

34.70 

4.16 

7.  Conclusions.  In  this  work,  various  preconditioners  for  Krylov-based  methods  and 
solver  methods  based  on  Gauss-Seidel  and  Jacobi  method  are  introduced.  Results  are  com¬ 
pared  with  GMRES  with  mean-based  preconditioning  and  collocation.  In  the  case  of  lin¬ 
ear  dependence  on  the  random  variables,  we  generally  find  the  more  intrusive  Krylov-based 
approaches  to  be  more  efficient  than  the  non-intrusive  collocation  approach,  with  the  less 
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Table  6.4 

Solution  time  (sec)  vs  stochastic  dimension  when  diffusion  coefficient  is  log-normal  random  field,  PC  order  = 
5  and  cr-0.1 


Stoch. 

dim 

Preconditioners  for  GMRES 

GS  and  Jacobi  Solvers 

SprseGrid 

collocation 

MB 

AGS 

AJ 

GS 

GS  i 

GS  2 

Jacobi 

2 

0.26 

0.23 

0.23 

0.41 

0.59 

0.78 

1.61 

0.57 

3 

1.40 

1.15 

1.06 

2.13 

2.16 

3.57 

5.26 

3.25 

4 

7.10 

5.45 

4.79 

10.17 

7.14 

14.30 

16.80 

10.91 

5 

28.08 

20.60 

19.46 

38.68 

22.30 

47.90 

49.70 

27.60 

6 

93.20 

66.05 

59.84 

132.30 

memory 

139.00 

137.00 

63.98 

Table  6.5 

Solution  time  (sec)  vs  order  of  polynomial  chaos  when  diffusion  coefficient  is  lognormal  random  field,  Stock 
dim-3,  cr  -  0.1 


PC 

order 

Preconditioners  for  GMRES 

GS  and  Jacobi  Solvers 

SprseGrid 

collocation 

MB 

AGS 

AJ 

GS 

GSi 

gs2 

Jacobi 

2 

0.06 

0.05 

0.07 

0.11 

0.21 

0.21 

0.45 

0.11 

3 

0.16 

0.13 

0.15 

0.25 

0.51 

0.50 

1.11 

0.42 

4 

0.47 

0.40 

0.37 

0.77 

1.04 

1.23 

2.47 

1.29 

5 

1.39 

1.15 

1.06 

2.12 

2.15 

3.53 

5.23 

3.28 

6 

4.01 

3.05 

2.74 

5.50 

4.39 

10.60 

11.50 

7.39 

7 

8.69 

6.23 

5.50 

11.18 

8.03 

22.30 

23.50 

15.02 

8 

23.79 

15.40 

13.88 

30.47 

17.30 

63.50 

54.10 

28.40 

9 

54.36 

35.40 

34.80 

71.38 

36.40 

161.00 

126.00 

50.57 

10 

106.40 

66.95 

66.80 

125.96 

64.70 

295.00 

253.00 

87.27 

Table  6.6 

Solution  time  (sec)  vs  standard  deviation  (cr)  when  diffusion  coefficient  is  log-normal  random  field,  Stoch  dim 
=  3  and  PC  order  =  5 


Preconditioners  for  GMRES 

GS  and  Jacobi  Solvers 

SprseGrid 

cr 

MB 

AGS 

AJ 

GS 

GS  i 

gs2 

Jacobi 

collocation 

0.10 

1.51 

1.23 

1.14 

2.08 

2.11 

3.56 

5.21 

3.25 

0.15 

1.86 

1.42 

1.25 

2.49 

2.86 

4.87 

9.15 

3.64 

0.20 

2.20 

1.72 

2.06 

3.06 

3.63 

6.17 

46.80 

4.11 

0.25 

2.66 

2.03 

DNC 

3.58 

4.87 

8.40 

Div. 

4.59 

0.30 

3.11 

2.35 

DNC 

4.35 

6.90 

11.90 

Div. 

5.13 

0.35 

3.65 

2.87 

DNC 

5.17 

11.20 

19.30 

Div. 

5.72 

0.40 

4.44 

3.39 

DNC 

6.16 

24.30 

41.90 

Div. 

6.53 

intrusive  Gauss-Seidel/Jacobi  approaches  in  between.  This  demonstrates  a  trade-off  in  per¬ 
formance  versus  intrusiveness  when  existing  legacy  simulation  codes  must  be  used.  In  the 
case  of  nonlinear  dependence  on  the  random  variables  however,  extrapolating  beyond  these 
tables  suggests  the  non-intrusive  collocation  approach  is  in  fact  the  most  efficient.  This  gen¬ 
erally  suggests  that  for  linear  problems  (in  the  random  variables  and  the  solution),  intrusive 
stochastic  Galerkin  approaches  would  be  preferred,  but  for  nonlinear  problems  (in  either  the 
random  variables  or  the  solution),  non-intrusive  approaches  would  be  preferred.  Regardless, 
the  use  of  approximate  Gauss-Seidel  or  Jacobi  preconditioners  is  a  significant  improvement 
over  traditional  mean-based  preconditioning.  In  the  future,  we  want  to  compare  the  Kro- 


R.  Tipireddy,  E.T.  Phipps  and  R.G.  Ghanem 


89 


(a)  Uniform  random  field,  cr  =  0.12 


(b)  Log-normal  random  field,  cr  =  0.2 


Fig.  6.1.  Relative  error  norm  vs  iteration  count  for  various  solvers  and  preconditioners,  where  M  =  5  and  p  =  5 


necker  product  preconditioner  proposed  in  [16]  with  the  above  methods.  We  would  also  like 
to  investigate  block  and  recycled  Krylov  methods  to  improve  the  efficiency  of  the  Jacobi  and 
Gauss-Seidel  solvers. 
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UNCERTAINTY  QUANTIFICATION  OF  THE  SEMICONDUCTOR 
DRIFT-DIFFUSION  EQUATIONS 

CHRISTOPHER  W.  MILLER*,  RAYMOND  S.  TUMINARO* *,  ERIC  T.  PHIPPS*,  AND  HOWARD  C.  ELMAN§ 

Abstract.  The  movement  of  charge  carriers  in  a  semiconductor  device  is  modeled  by  a  set  of  coupled  non-linear 
partial  differential  equations  known  as  the  drift-diffusion  equations.  The  physical  parameters  involved  in  defining 
these  equations  are  subject  to  large  amounts  of  uncertainty.  The  aim  of  this  paper  is  to  examine  the  applicability 
of  uncertainty  quantification  techniques  to  this  problem.  We  express  the  uncertainty  regarding  these  parameters  by 
modeling  them  as  random  variables  and  apply  anisotropic  sparse  grid  collocation  to  the  resulting  set  of  stochastic 
partial  differential  equations.  We  identify  the  most  sensitive  parameters  using  local  sensitivity  analysis,  and  use  this 
information  to  formulate  a  reduced  problem.  We  apply  Sobol’  sensitivity  analysis  to  the  solution  of  the  reduced 
model  and  analyze  the  probability  distribution  of  the  model  outputs  at  various  time  steps.  This  preliminary  work 
reveals  new  approaches  for  quantifying  uncertainty  in  semiconductor  devices. 


1.  Introduction.  Radiation  interacts  with  semiconductor  devices  by  knocking  atoms 
from  the  device’s  silicon  lattice.  These  defect  species,  which  can  consist  of  silicon  or  the 
P-type  or  N-type  dopants,  can  carry  charge  and  propagate  through  the  device.  Examining  the 
behavior  of  semiconductor  devices  in  radioactive  environments  is  complicated  by  the  cost  and 
lack  of  availability  of  experimental  facilities.  To  alleviate  this,  Sandia  National  Laboratories 
has  invested  in  the  use  of  computational  modeling  to  examine  the  performance  of  semicon¬ 
ductor  devices  in  radioactive  environments.  In  particular,  the  finite  element  code  Charon  was 
developed  to  discretize  and  solve  the  semiconductor  drift-diffusion  equations  that  model  the 
movement  of  charge  carriers  inside  semiconductor  devices  [5]  [15]. 

A  difficulty  with  this  model  is  that  the  parameters  that  describe  the  interactions  of  the 
defect  species  are  often  only  known  to  a  limited  accuracy.  Confidence  in  the  model  solutions 
is  then  limited  by  lack  of  confidence  in  the  accuracy  of  the  parameter  values.  The  lack  of 
knowledge  represents  an  epistemic  uncertainty  which  can  be  quantified  by  modeling  the  pa¬ 
rameters  as  random  variables.  As  a  consequence  of  the  Doob-Dynkin  lemma,  the  solution  of 
the  drift-diffusion  equations  can  be  represented  as  a  random  process  of  the  uncertain  param¬ 
eters  [4] .  Several  methods  have  been  developed  to  approximate  the  statistics  associated  with 
such  a  random  solution  process,  including  the  Monte-Carlo  method  [9],  and  more  recently, 
the  stochastic  collocation  methods  [2],[10],[11],[16].  In  this  paper  we  apply  the  anisotropic 
sparse  grid  collocation  method  because  of  its  favorable  convergence  properties  and  minimal 
dependence  on  the  size  of  the  parameter  space  [11].  The  solution  process  arising  from  the 
collocation  method  can  be  post  processed  to  compute  statistical  quantities  associated  with  the 
solution  process. 

This  work  is  viewed  as  a  continuation  of  [12],  which  performed  a  transient  sensitivity 
analysis  of  the  solution  of  the  deterministic  drift-diffusion  equations  with  respect  to  the  de¬ 
fect  reaction  parameters.  Here  we  seek  to  identify  additional  methods  for  quantifying  the 
uncertainty  inherent  in  this  problem.  The  structure  of  this  paper  is  as  follows.  Section  2 
describes  the  deterministic  formulation  of  the  semiconductor  drift  diffusion  equations  and 
describes  our  extension  of  this  problem  from  a  deterministic  setting  to  a  non-deterministic 
one.  Section  3  describes  the  sparse  grid  stochastic  collocation  method  used  to  perform  the 
uncertainty  quantification  calculations.  Section  4  describes  how  the  solution  derived  using 
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(a) 


Fig.  2.1.  Scanning  electron  microscope  of  an  NPN  BJT  (a)  and  diagram  of  the  emitter,  base,  and  collector 
regions  (b)[12]. 


the  collocation  method  can  be  examined  to  explore  interactions  among  parameters.  Section 
5  describes  the  application  of  these  methods  to  a  semiconductor  device  under  the  influence 
of  a  radiation  pulse.  Finally,  in  Section  6  we  draw  some  conclusions  and  propose  additional 
applications. 

2.  Non-deterministic  Problem  Formulation.  The  device  considered  in  this  paper  is  an 
NPN  bipolar  junction  transistor  (BJT)  subjected  to  a  radiation  pulse.  The  device  is  pictured 
in  Figure  2.1.  In  a  BJT  the  silicon  lattice  has  been  modified  by  the  introduction  of  dopants 
to  produce  an  excess  of  free  electrons  in  the  N-regions  (N-doping),  and  to  produce  an  excess 
of  holes  (positive  charge  carriers)  in  the  P-region  (P-doping).  The  P-dopant  in  the  device 
considered  here  is  boron,  while  the  N-dopant  is  phosphorus.  Radiation  interacts  with  the 
device  by  knocking  an  atom  free  from  the  lattice.  This  creates  a  free  interstitial  atom  and 
a  vacancy  referred  to  as  a  Frenkel  pair.  Both  the  free  atom  and  vacancy  can  carry  charge, 
move  through  the  device,  and  interact  with  other  defect  species.  The  diffusion,  transport,  and 
generation  of  charge  carriers  are  governed  by  the  following  set  of  coupled  partial  differential 
equations 


f  N  ) 

-V-(A2Vlf,)  =  lp-n  +  C^jZ;Yi 
V  i=  1  J 

dn 

V  •  (~pnnWif/  +  DnVn)  =  —  +  R„ 

ot 

dp 

V  •  0 UppSif/  +  DpWp )  =  -f^+Rp 
V  •  0 uy,  W  +  DyVYd  =  ^  +  Rr„  i  =  h  -V 


(2.1) 


where  if/  is  the  scalar  electric  potential,  n ,  p ,  and  Yt  are  the  electron,  hole,  and  Ith  defect  species 
densities  respectively.  For  v  e  { n ,  p ,  7/},  Dx  and  px  are  mobility  and  diffusivity  coefficients 
for  species  x,  is  the  integer  charge  of  the  ith  defect  species,  A  is  the  Debye  length  of  the 
device  and  C  is  the  doping  profile.  The  generation  and  recombination  of  species  v  is  given 
by  the  right  hand  side  term  Rx  [15].  The  reactions  we  are  concerned  with  are  the  so-called 
carrier-defect  reactions:  reactions  between  a  defect  and  a  hole  or  electron.  These  reactions 
have  the  form 


Xm  ->  Xm+l  +e~  Xm  ->  xm~l  +  h+  (Generation)  (2.2) 

Xm+l  +  e~  ->  Xm  Xm~l  +  h+  ->  Xm  (Recombination). 


The  forcing  term  associated  with  these  reactions  is  modeled  by 

RXm+ 1  =  crAXmexp 


(2.3) 
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Here  X  denotes  the  concentration  of  a  certain  defect  species  with  superscripts  denoting  the  in¬ 
teger  charge  of  the  defect.  A  is  a  constant,  <x  is  the  reaction  cross  section,  A E  is  the  activation 
energy,  k  is  Boltzmann’s  constant  and  T  is  the  lattice  temperature.  The  parameters  that  we 
investigate  are  the  reaction  cross-sections  and  the  activation  energies.  A  subscript  on  cr  and 
A E  is  omitted  to  improve  readability;  however  each  reaction  has  a  different  value  for  each  of 
these  parameters.  The  activation  energy  for  a  recombination  reaction  is  known  to  be  equal  to 
zero  [12].  For  our  problem  there  are  84  carrier-defect  reactions  involving  35  defect  species 
and  a  total  of  127  reaction  parameters.  Table  2.1  shows  a  small  sample  of  the  carrier-defect 
reactions  along  with  an  estimate  of  one  of  the  associated  parameters. 


# 

Reaction 

Parameter 

Approximate  Value 

13 

e~  +  v~  ->  v~ 

cr 

3.0  x  1CT16 

14 

V—  ->  e-  +  V~ 

A  E 

0.09 

40 

e~  +  BV+  ->  BV° 

cr 

3.0  x  10-14 

Table  2.1 

A  sample  of  the  127  carrier-defect  reactions  [12] 


Discretization  and  numerical  solution  to  the  partial  differential  equations  in  (2.1)  is  ac¬ 
complished  using  Sandia’s  Charon  software.  Chemical  kinetics  computations  are  accom¬ 
plished  using  CHEMKIN  [6] .  Charon  uses  a  Galerkin  finite  element  discretization  consisting 
of  two-dimensional  piecewise  bilinear  finite  element  functions  defined  on  a  mesh  of  quadrilat¬ 
erals  with  streamline  upwind  Petrov-Galerkin  stabilization.  Further  details  of  the  discretiza¬ 
tion  procedure  used  can  be  found  in  [5].  In  this  study  we  perform  the  calculations  on  the 
pseudo  one-dimensional  domain  shown  as  the  vertical  white  strip  under  the  emitter  in  Figure 
2.1.  Two-dimensional  effects  do  not  arise  for  the  device  operating  under  the  conditions  con¬ 
sidered  in  this  paper. 

As  stated  previously,  the  reaction  parameters  cr  and  A E  appearing  in  (2.2)  are  subject  to 
a  large  degree  of  uncertainty.  Our  aim  here  is  to  develop  quantitative  insight  into  the  effects 
of  these  reaction  parameters,  denoted  generically  here  as  {£}.  To  accomplish  this,  we  model 
each  of  the  127  reaction  parameters  as  a  uniform  random  variable  centered  at  jif  the  es¬ 
timated  value  from  previous  deterministic  studies  [12].  The  probability  distribution  of  each 
random  variable  is  then  given  by 

Pi(£i)  ~  i  2jj.  ^  ^i-ApiPi+.Cpi]  •  (2.4) 

Although  these  intervals  are  large,  they  still  may  underestimate  the  uncertainty  associated 
with  the  parameters.  We  assume  that  the  uncertainties  with  respect  to  each  parameter  are 
independent  and  so  we  can  define  a  joint  probability  distribution  on  the  parameter  space  by 

127  127  i 

P(0  =  n^)  =  n  (2.5) 

i=l  i=l  ^ 

The  problem  is  now  to  find  random  processes 

^(x,r,£),  n(x,t,£),  h(x,t,i),  Yk(x,t,£)  (2.6) 

that  satisfy  (2.1)  almost  surely.  It  should  be  noted  that  additional  uncertainties  are  associated 
with  the  doping  profile  and  the  diffusivity  and  mobility  coefficients.  These  are  not  considered 
in  this  paper 
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The  assumption  of  independence  deserves  some  scrutiny  since  the  parameters  associ¬ 
ated  with  all  of  the  reactions  involving  a  given  defect  species  are  almost  certainly  correlated. 
However  it  is  shown  in  [2]  that  the  replacement  of  the  true  joint  density  function  with  the 
product  of  the  marginal  density  functions  only  affects  the  convergence  of  collocation  meth¬ 
ods  up  to  a  constant.  This  constant  may  be  large  however  and  further  work  is  required  to 
better  approximate  the  joint  PDF  appearing  in  (2.5). 

3.  The  Stochastic  Collocation  Method.  A  methodology  for  computing  the  random 
processes  in  (2.6)  is  the  stochastic  collocation  method.  The  sparse  grid  collocation  method 
was  first  described  in  [16]  and  error  analysis  was  performed  in  [2]  and  [10].  These  meth¬ 
ods  are  all  suited  to  problems  whose  dependence  on  the  random  parameters  is  isotropic.  We 
choose  to  apply  an  anisotropic  version  of  the  sparse  grid  collocation  method  developed  by 
[11].  Other  approaches  for  addressing  anisotropic  problems  can  be  found  in  [7]  and  [8].  Here 
we  only  present  the  derivation  of  the  isotropic  method  described  in  [16]. 

In  order  to  derive  the  stochastic  collocation  method,  one  begins  by  considering  interpo¬ 
lation  operators  defined  for  one-dimensional  functions  defined  on  a  finite  interval.  Without 
loss  of  generality,  we  can  assume  that  the  interval  is  [-1,1].  Let  /  :  [-1, 1]  — »  R  and  define 
the  interpolation  operator 


m 

unm  =  £  f({(k>Ma  (3.D 

k=  1 

where  }  =  6m  is  a  set  of  m  distinct  points  and  where  4  is  the  Lagrange  interpolating 
polynomial  of  degree  m  -  1  defined  by 


UkU))  =  SkJ.  (3.2) 

Evaluation  of  the  interpolant  requires  the  evaluation  of  the  function  /  at  the  points  contained 
in  6m.  By  construction  we  have  that  f(£®)  =  <Umf(^k))  for  all  in  6m. 

Now  we  consider  interpolation  in  multiple  dimensions.  Let  /  :  [-1, 1]M  — >  R.  In  order 
to  generalize  the  one-dimensional  interpolation  operators  to  multiple  dimensions,  an  obvious 
approach  would  be  to  take  tensor  products  of  one-dimensional  interpolation  operators  'Ll"1* 
along  each  coordinate  axis.  Define  the  tensor  product  interpolant  by 

tfensorf(Z)  =  'll"'1'  ®  ®  •  •  •  ®  M).  (3.3) 

The  multi-index  i  e  N+M  describes  how  many  interpolation  points  are  used  along  each  axis. 
The  evaluation  of  J[Tensor  requires  the  evaluation  of  the  function  /  on  the  grid 

®«  =  x"10m.,  (3.4) 

with  the  cardinality  of  this  grid  given  by 


M 

\®i\  =  Y]mik.  (3.5) 

k=  1 

The  relation  (3.5)  is  referred  to  as  the  curse  of  dimensionality  [16]  since  the  cardinality  of 
the  grid  grows  exponentially  in  the  dimension  M.  Thus  for  problems  involving  a  moderate 
or  large  number  of  parameters,  use  of  the  tensor  product  formula  (3.3)  is  computationally 
infeasible. 

Smolyak  sparse  grids  provide  a  method  of  approximating  multi-dimensional  functions 
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that  avoids  the  curse  of  dimensionality.  Sparse  grid  interpolation  was  introduced  in  [13].  The 
sparse  grid  interpolant  is  formed  by  taking  a  selective  sum  of  tensor  product  rules  appearing 
in  (3.3).  Define  the  index  set 

YqM  =  ji  e  NM,i  >  1  :  q  -  M  +  1  <  g(4  -  1)  <  ?J .  (3.6) 

Then  the  sparse  grid  interpolation  operator  is  given  by 

=  y  (-l)9+MHi|(  M~  1  n)  •  W'"'1  ®  •  •  •  ®  )•  (3.7) 

Evaluation  of  this  interpolation  operator  requires  the  evaluation  of  the  function  /  on  the  sparse 
grid 

'HqM  =  1J  X  •  •  •  X  GmiM).  (3.8) 

1^  Yq,M 


In  order  to  fully  define  the  sparse  grid  interpolation  operator  it  is  necessary  to  specify  the 
points  used  in  constructing  the  one-dimensional  interpolation  operators.  It  is  advantageous  if 
the  grids  have  the  property  that  J(q,M  c  <Hq+\,M-  One  way  to  accomplish  this  is  to  construct 
the  one-dimensional  operators  using  the  Clenshaw-Curtis  abscissas  [16].  Let 


3  =  -cos 


n(j~  D\ 
m-  1  /  ’ 


7=1, 


nii  = 


1  if  i  -  1 , 

2‘-'  +1  if  i  >  1 


nii 


(3.9) 

(3.10) 


With  this  choice  of  points  we  obtain  6t  c  6i+ 1  and  hence  c  <Kq+  i,m-  case  (3.7) 

and  (3.8)  simplify  to 


Am/=  E  (-1)?+MHil 


H  M  1  ] .  (<u" 
q  +  M  -  |i|/ 


(3.11) 


and 


^q,M  =  (J  (Gm,,  X  •  •  •  X  GmiJ  (3.12) 

1  tXq,M 

(M 

i  e  N+“  :  ^(4  -  1)  =  q 

k=  1 

It  is  known  that  if  /  is  a  M- variate  polynomial  of  total  degree  q-M+1  then  qMf  -  f  [3]. 
Thus  one  can  expect  that  if  /  is  sufficiently  regular  than  the  approximation  Ji qMf  converges 
quickly  in  the  sparse  grid  level  q.  This  statement  is  made  precise  in  [10]. 

The  solution  obtained  through  collocation  can  be  post-processed  to  compute  various 
other  quantities  of  interest  [16].  Moments  of  /  can  be  approximated  by 

E(fm)*B(J[qMn  (3.13) 

where  the  quantity  on  the  right  of  (3.13)  can  be  computed  exactly  and  efficiently  using 
Clenshaw-Curtis  quadrature.  One  may  also  want  to  compute  probability  distributions  as¬ 
sociated  with  the  solution.  These  can  be  approximated  by  sampling  the  collocation  solution 
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Fig.  3.1.  Clenshaw- Curtis  based  sparse  grids  ,H(2, 3)  and  *H( 3, 3) 


at  a  random  set  of  points  and  then  measuring  the  relative  frequency  of  the  desired  event.  As¬ 
sume  that  we  sample  N  times.  Let  A  =  be  the  set  of  sample  points  and  let 

AftqMf<c  be  the  subset  of  points  in  A  such  that  <  c,  then 


(3.14) 


The  advantage  of  this  approach  to  computing  the  CDF  instead  of  direct  sampling  of  the  func¬ 
tion  /  is  that  evaluation  of  can  often  require  much  less  computational  effort  than 

directly  evaluating  /.  Our  routines  for  performing  the  sparse  grid  collocation  method  and 
post  processing  the  collocation  solution  are  provided  by  Sandia’s  Dakota  software  [1]. 

4.  Sobol’  Sensitivity  Analysis.  Given  a  function  /  :  [-1, 1]M  — >  R  of  M  parameters 
one  may  be  interested  in  how  perturbations  of  each  parameter  contribute  to  changes  in  the 
function  value.  In  many  cases  however  the  function  value  is  not  sensitive  with  respect  to  per¬ 
turbations  of  a  single  parameter  but  rather  is  sensitive  with  respect  to  simultaneous  changes 
in  multiple  parameters.  Sobol'  sensitivity  analysis  is  one  method  for  describing  the  sensitivity 
of  the  function  with  respect  to  coupled  subsets  of  parameters.  The  method  proceeds  by  per¬ 
forming  a  standard  ANOVA  decomposition  of  the  function  /  and  then  computing  the  Sobol  ’ 
indices  as  a  ratio  of  the  partial  variance  to  the  total  variance  [14]. 

The  ANOVA  decomposition  of  /  is  as  follows.  Assuming  that  /  is  square  integrable, 
decompose  /  into  the  sum 


M 


(4.1) 


s=i  i\ <...</, 
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Then  define  the  partial  and  total  variances  as 

f  fl . idSu-.-dSi  and  D  =  f  -  ft. 

J[-1,1]S  J[-1,1]M 

Finally  define  the  Sobol’  index  and  total  Sobol’  index  as 

D  M~l 

s iu..,h  =  and  TSt  =  £  £  Sh 

5=1  h<...<k<...<ls 


(4.2) 


(4.3) 


The  Sobol’  index  measures  the  dependence  of  the  function  on  each  subset  of  the  M 
parameters  while  the  total  Sobol’  index  TSi  measures  the  dependence  of  the  function  /  on 
the  ith  parameter.  There  is  a  total  of  2M  terms  appearing  in  (4.1),  2M  -  1  Sobol’  indices,  and 
M  total  Sobol’  indices.  The  Sobol’  indices  can  be  used  to  measure  the  strength  of  coupling 
effects  between  specific  subsets  of  parameters  on  the  function  /.  The  Sobol’  indices  Si  for 
1  <  i  <  M  are  referred  to  as  the  main  effect  of  parameter  i.  The  total  Sobol’  index  can  be  used 
as  a  global  sensitivity  measure  of  /  with  respect  to  a  given  parameter.  We  define  the  ratio 


TSi- Si 

Vj  =  - 

TSi 


(4.4) 


as  the  relative  error  between  the  ith  total  Sobol’  index  and  the  main  effect  of  parameter  i  for 
1  <  i  <  M.  Obviously  rt  satisfies  0  <  rt  <  1 .  If  rt  is  close  to  1  then  most  of  the  sensitivity  with 
respect  to  parameter  i  is  tied  up  in  coupling  effects.  If  rt  is  close  to  0  then  the  ith  parameter  is 
only  weakly  coupled  to  the  rest  of  the  parameters.  This  information  can  be  used  to  examine 
the  strength  of  parameter  coupling  effects  on  the  model  solution. 

5.  Application  to  the  semiconductor  drift  diffusion  equations.  In  this  section  we  ap¬ 
ply  the  above  techniques  to  the  analysis  of  a  radiation  damaged  BJT.  A  radiation  pulse  is 
simulated  by  adding  a  transient  source  of  electrons  and  holes  [12].  The  shape  of  this  pulse 
is  shown  in  Figure  5.1.  The  radiation  pulse  begins  at  time  t  =  lx  10-5  and  ends  at  time 
t  =  2  x  10“4. 


Fig.  5.1.  Radiation  pulse  [12] 


Our  goal  is  to  investigate  the  current  I(t ,  f )  at  the  base  contact  as  a  function  of  time  and 
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the  uncertain  reaction  parameters.  In  principle  one  could  use  the  model  for  the  reaction  pa¬ 
rameters  described  in  (2.5)  to  perform  a  collocation  study  that  would  generate  an  approximate 
response  surface  over  the  entire  parameter  space.  However  it  is  computationally  intractable 
to  do  this  on  the  127  parameter  space  that  includes  all  of  the  reaction  parameters.  So  first  a 
form  of  model  reduction  is  necessary. 

In  order  to  reduce  the  number  of  parameters  to  be  included  in  the  collocation  study  we 
first  execute  a  “one-at-a-time”  (OAT)  study  to  estimate  the  sensitivity  of  the  solution  with 
respect  to  each  individual  parameter.  For  this,  we  fix  126  of  the  127  uncertain  parameters 
at  their  mean  value  and  perform  a  one-dimensional  collocation  to  approximate  the  function 
I(jd i,  ...jUm,  t)  as  a  function  of  time  and  a  single  reaction  parameter  This  approach 

scales  very  well  since  it  only  requires  the  solution  of  a  series  of  one  dimensional  problems. 
From  the  collocation  approximation  of  this  function  we  can  compute  estimates  of  the  sensi¬ 
tivity  of  I  with  respect  to  a  single  parameter.  We  use  two  metrics  to  measure  the  sensitivity, 
o~i,  the  standard  deviation  of  the  current  with  respect  to  the  ith  parameter  and  j^| £=/ii,  the 
scaled  sensitivity  of  parameter  i  evaluated  at  the  mean  parameter  value.  The  evaluation  of  the 
scaled  sensitivity  is  also  investigated  in  [12].  These  two  measures  are  plotted  for  each  of  the 
127  parameters  at  a  variety  of  times  in  Figure  5.2.  The  plots  in  Figure  5.2  are  normalized 
so  that  Yjj=  i  °~i  =  1  and  Yj)=  i  j  j|:l ^=m  =  1.  Thus  each  bar  can  be  interpreted  as  the  relative 
importance  of  parameter  i  with  respect  to  each  sensitivity  metric. 

We  make  two  observations.  First  the  standard  deviation  and  the  scaled  sensitivity  gen¬ 
erally  show  broad  agreement  in  which  parameters  are  considered  important.  However  there 
are  a  few  instances  where  the  two  are  noticeably  different.  The  differences  are  attributable  to 
the  fact  that  the  standard  deviation  is  an  inherently  global  measure  of  sensitivity  whereas  the 
point  derivative  is  an  inherently  local  measure.  We  believe  that  the  standard  deviation  may 
be  a  more  reliable  sensitivity  metric  because  it  takes  into  account  the  behavior  of  the  current 
over  a  range  of  parameter  values  rather  than  simply  at  a  single  point.  Second,  the  current 
only  exhibits  sensitivity  with  respect  to  a  relatively  small  number  of  parameters.  A  response 
surface  constructed  using  only  the  15  most  sensitive  parameters  should  be  of  similar  accuracy 
to  the  response  surface  constructed  using  the  full  127  parameter  space. 

We  use  the  15  most  sensitive  reaction  parameters  to  perform  a  multi-dimensional  colloca¬ 
tion  study.  We  perform  a  level  6  collocation  method  on  the  15  dimensional  parameter  space. 
In  order  to  simplify  the  problem  further  we  use  the  scaled  standard  deviations  of  the  current 
in  each  of  these  parameters  as  weights  for  an  anisotropic  sparse  grid  collocation  method,  as  in 
[11].  This  method  requires  the  discretization  and  solution  of  (2.1)  at  371  points  in  the  reduced 
parameter  space.  This  computation  was  performed  on  Sandia’s  Red  Sky  supercomputer  and 
took  approximately  41  hours  using  64  cores.  The  results  of  this  multi-dimensional  colloca¬ 
tion  study  were  post  processed  to  examine  the  Sobol’  sensitivity  indices  and  the  probability 
distributions  of  the  current  at  various  time  steps. 

Figure  5.3  shows  the  cumulative  distribution  function  of  the  current  at  a  series  of  time 
steps.  At  time  t-  lx  1(T4  the  solution  is  nearly  in  a  steady  state  defined  by  the  initial  bound¬ 
ary  value  because  the  radiation  pulse  hasn’t  yet  generated  many  defects  and  as  expected,  the 
CDF  of  the  current  is  nearly  a  Heaviside  function.  Immediately  after  the  pulse  has  ended  at 
t  -  lx  10-3  the  current  exhibits  a  large  degree  of  variability  owing  to  the  large  number  of 
Frenkel  pairs  generated  by  the  radiation  pulse.  The  solution  then  quickly  settles  down  into  a 
new  steady  state  at  time  t  -  lx  10-2.  The  current  at  this  steady  state  displays  a  relatively 
small  amount  of  variation  over  the  parameter  space.  The  overall  uncertainty  in  the  device 
current  varies  as  a  function  of  time.  This  indicates  that  the  total  number  of  collocation  points 
as  well  as  the  anisotropic  weights  should  be  allowed  to  vary  as  a  function  of  time  to  optimally 
capture  the  behavior  of  the  solution.  Such  a  technique  would  require  more  intrusive  modifi- 
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Fig.  5.2.  Sensitivity  Metrics  at  t  =  1  X  10  4  (top),  t  =  1  X  10  3  (middle),  and  t  -  1  X  10  2  (bottom) 
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cation  of  the  existing  code  as  the  collocation  method  would  need  to  be  executed  as  part  of  the 
time  series  integrator. 


Fig.  5.3.  CDF  of  Current  at  various  time  steps 

Figure  5.4  shows  the  ratios  rt  =  TS^Si  computed  from  the  15  parameter  study.  In  per¬ 
forming  such  a  sensitivity  analysis,  it  may  be  the  case  that  parameter  interactions  account  for 
only  a  small  portion  of  the  sensitivity  in  the  current.  In  this  case  we  would  expect  rt  to  be 
close  to  zero  for  all  of  the  parameters.  This  would  indicate  that  most  of  the  information  con¬ 
tained  in  the  response  surface  could  be  derived  from  the  behavior  of  the  current  along  each 
parameter  axis.  Figure  5.4  indicates  that  this  is  not  the  case.  The  figure  reports  that  many 
of  the  parameters  are  coupled  to  produce  changes  in  the  current.  Therefore  we  see  that  the 
OAT  study  does  not  contain  enough  information  to  reconstruct  the  response  surface  since  it 
doesn’t  contain  information  regarding  what  is  occurring  in  the  comer  of  the  parameter  space. 
The  fact  that  OAT  studies  do  not  explore  the  behavior  of  a  function  in  the  corners  of  the  pa¬ 
rameter  domain  is  considered  a  major  weakness.  The  data  contained  in  the  Sobol’  indices 
can  be  used  to  isolate  correlated  parameters  that  should  not  be  considered  separately  in  OAT 
studies. 

6.  Conclusions.  The  goal  of  this  paper  was  to  determine  what  types  of  information 
could  be  obtained  from  applying  modern  uncertainty  quantification  techniques  to  large  scale 
engineering  problems.  A  great  deal  of  information  can  be  obtained  from  uncertainty  quan¬ 
tification  techniques  that  approximate  the  response  functions  on  the  entire  parameter  space  as 
opposed  to  a  single  point.  In  particular,  the  standard  deviation  and  Sobol’  sensitivity  indices 
provide  insight  to  the  global  sensitivity  of  the  response  function  with  respect  to  the  parame¬ 
ters.  Also,  collocation  methods  can  be  used  to  obtain  approximations  of  the  density  function 
associated  with  the  solution  process  which  can  be  valuable  as  part  of  a  reliability  analysis.  In 
the  future  we  believe  that  it  may  be  possible  to  use  the  results  of  a  Sobol’  sensitivity  analysis 
to  uncover  hidden  parameters  which  may  lead  to  further  reduced  models  and  additional  in¬ 
sights  into  the  physics  of  the  problem. 

There  are  a  number  of  additional  areas  which  are  natural  extensions  of  this  work.  One 
possibility  is  the  implementation  of  a  time  adaptive  sparse  grid  collocation  method  to  effi¬ 
ciently  compute  the  response  function  at  intermediate  time  steps.  Another  interesting  possi¬ 
bility  is  to  use  the  reduced  model  within  an  inverse  formulation  to  assess  a  tolerable  range  of 
uncertainty  within  individual  parameters.  Understanding  which  of  the  problem’s  uncertain¬ 
ties  are  most  important  may  help  guide  future  experimentation.  The  possible  dependence  of 
the  uncertain  quantities  also  warrants  further  study. 
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computed  for  the  15  most  sensitive  parameters  at  various  time  steps. 
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KRYLOV  RECYCLING  FOR  CLIMATE  MODELING  AND  UNCERTAINTY 

QUANTIFICATION 

KAPIL  AHUJA*,  MICHAEL  L.  PARKS1,  ERIC  T.  PHIPPS* *,  ANDREW  G.  SALINGER§,  AND 

ERIC  DE  STURLER^ 

Abstract.  Krylov  subspace  recycling  is  a  technique  to  accelerate  the  convergence  of  sequences  of  slowly  chang¬ 
ing  linear  systems.  Ice  sheet  modeling  and  embedded  uncertainty  quantification  are  two  application  areas  where  such 
systems  arise.  Typically,  recycling  algorithms  assume  that  the  number  of  vectors  selected  for  recycling  is  less  that 
the  total  number  of  iterations  required  to  solve  a  system.  Hence,  these  algorithms  are  useful  when  each  system  in  the 
sequence  requires  a  relatively  large  number  of  iterations  to  converge.  For  the  application  areas  under  study,  the  num¬ 
ber  of  iterations  required  for  convergence  of  a  system  is  small.  Hence,  our  existing  family  of  recycling  algorithms 
are  not  suitable  in  their  current  forms.  However,  this  is  not  a  problem  inherent  to  recycling.  We  modify  GCRO-DR 
such  that  the  recycle  space  can  be  built  even  for  rapidly  converging  linear  systems.  Hence,  the  number  of  vectors 
selected  for  recycling  is  no  longer  constrained  by  the  total  number  of  iterations  required  to  solve  a  system. 

GCRO-DR  uses  approximate  invariant  subspace  as  the  recycle  space.  This  choice  is  not  always  the  best.  We 
show  in  this  paper  that  another  choice  works  better  for  the  application  areas  under  consideration.  We  use  Arnoldi 
vectors  that  have  large  components  in  the  right  hand  side  as  the  recycle  space.  In  addition,  since  our  systems 
converge  rapidly,  we  modify  GCRO-DR  to  avoid  an  extra  orthogonalization  step,  leading  to  reduced  cost.  Numerical 
experiments  show  the  benefit  of  using  modified  GCRO-DR  for  embedded  uncertainty  quantification.  We  show 
savings  in  iteration  count  as  well  as  time.  We  also  modify  the  recycling  conjugate  gradients  (RCG)  algorithm  in  a 
similar  way.  The  details  for  RCG  are  not  described  here. 

We  also  do  performance  studies  on  standard  GCRO-DR  and  standard  RCG  for  the  above  mentioned  application 
areas.  The  results  show  that  recycling  does  not  impose  substantial  overhead.  The  experiments  are  done  using  a 
combination  of  recycling  solvers  implemented  in  the  Belos  package  of  the  Trilinos  project  and  Matlab. 


1.  Introduction.  Recycling  algorithms  are  appropriate  for  solving  sequences  of  linear 
systems 

A(ox(o  =  b{i)t  (Li) 

where  A(l}  e  C"x"  and  bl,)  e  C"  vary  with  i,  and  the  matrices  Ail}  are  large  and  sparse.  For 
any  particular  system  in  the  sequence,  recycling  algorithms  store  a  recycle  space  associated 
with  that  system,  and  use  this  subspace  to  accelerate  convergence  for  the  next  system  in 
the  sequence.  This  idea  originates  from  ‘thick  restarting’  used  for  solving  a  single  linear 
system  in  the  GCROT  [2]  and  the  GMRES-DR  [12]  algorithms.  For  solving  a  sequence  of 
linear  systems,  ‘recycling’  was  first  proposed  in  [13]  where  it  is  applied  to  the  GCROT  and 
the  GCRO-DR  algorithms.  The  idea  is  further  adapted  in  the  recycling  minimal  residual 
algorithm  (RMINRES  [19]),  the  recycling  conjugate  gradients  algorithm  (RCG  [14]),  and  the 
recycling  bi-conjugate  gradients  (RBiCG  [1])  algorithm. 

Such  sequences  of  linear  systems  arise  in  many  application  areas.  We  focus  on  two  such 
areas;  climate  modeling  [16]  and  embedded  uncertainty  quantification  [11,  10,  9].  The  linear 
systems  arising  in  these  application  areas  have  ill-conditioned  matrices.  Thus,  expensive 
preconditioners  are  required  for  the  Krylov  methods  to  converge  in  a  reasonable  number  of 
iterations.  Hence,  it  is  worthwhile  to  look  for  methods  that  can  help  reduce  the  iteration 
count. 

The  matrices  arising  in  climate  modeling  are  nonsymmetric  and  not  positive  real,  hence 
we  use  GCRO-DR,  a  recycling  variant  of  GMRES  [18].  The  embedded  uncertainty  quantifi¬ 
cation  application  consists  of  a  stochastic,  elliptic,  partial  differential  equation  (PDE)  [11], 
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and  ideally  should  lead  to  matrices  that  are  symmetric  positive  definite  (SPD).  However, 
the  current  implementation  of  boundary  conditions  in  the  code  based  on  [11]  destroys  the 
symmetry  in  the  matrices.  The  system  still  remains  B-normal(l)  and  hence  the  CG  method 
works  [6,  3].  Therefore,  we  use  both  RCG  (which  is  a  recycling  variant  of  CG)  and  GCRO- 
DR  for  this  application.  A  GMRES  based  recycling  solver  for  stochastic  elliptic  PDEs  is  used 
in  [10]  as  well,  where  the  authors  use  a  recycling  FGMRES  [17]. 

Recycling  algorithms  typically  have  two  input  parameters  that  control  how  the  recycle 
space  is  built.  The  first  parameter  controls  the  frequency  of  updates  of  the  recycle  space.  Ide¬ 
ally,  the  recycle  space  should  be  updated  at  the  end  of  the  solve,  but  that  would  be  expensive 
if  the  system  does  not  converge  quickly  (see  [13]  for  details).  The  iteration  process  between 
two  updates  of  the  recycle  space  is  referred  to  as  a  ‘cycle’.  The  length  of  the  cycle,  m,  refers 
to  the  number  of  iterations  between  updates  of  the  recycle  space.  The  second  parameter, 
usually  denoted  by  k ,  controls  the  size  of  the  recycle  space.  Typically,  recycling  algorithms 
are  designed  such  that  k  is  less  than  the  expected  number  of  iterations  required  for  conver¬ 
gence  of  the  average  system  (denoted  by  niter).  This  is  a  reasonable  assumption  because 
recycling  algorithms  target  applications  where  niter  is  large.  For  the  two  application  areas 
under  consideration,  the  linear  systems  converge  quickly,  due  to  the  application  of  effective 
but  also  expensive  preconditioners.  Hence,  k  must  be  very  small.  Because  of  this,  the  recycle 
space  does  not  contain  sufficient  information  to  accelerate  convergence,  and  recycling  does 
not  show  much  benefit.  We  modify  the  GCRO-DR  algorithm  such  that  k  can  be  set  greater 
than  niter.  We  show  in  Section  4  that  this  modification  leads  to  faster  convergence. 

In  many  cases,  deflation  of  eigenvalues  close  to  the  origin  improves  the  convergence 
rate  [12].  Hence,  current  recycling  algorithms  use  approximate  invariant  subspaces  corre¬ 
sponding  to  small  eigenvalues  (in  magnitude)  as  the  recycle  space  (eigenvalues  can  be  de¬ 
flated  by  including  the  corresponding  eigenvectors  in  the  Krylov  subspace).  However,  this 
approach  does  not  work  well  for  the  applications  we  consider.  Therefore,  we  propose  an 
alternative  mechanism  for  the  recycle  subspace  selection.  We  build  our  recycle  space  from 
Arnoldi  vectors  that  have  a  large  component  in  the  right  hand  side.  We  modify  the  GCRO- 
DR  algorithm  with  this  new  recycle  subspace  selection  approach.  The  results  show  that  this 
strategy  improves  convergence.  We  also  optimize  the  GCRO-DR  code  exploiting  the  fact  that 
each  system  in  our  sequence  of  systems  converges  quickly.  The  main  optimization  involves 
avoiding  an  extra  orthogonalization  step.  This  reduces  the  runtimes  for  our  algorithm.  We  do 
the  above  three  modifications  ( k  >  niter,  different  recycle  subspace  selection  criteria,  and 
code  optimizations)  to  the  RCG  algorithm  also.  As  the  modifications  are  similar  we  do  not 
discuss  them  here. 

To  show  that  recycling  does  not  impose  substantial  overhead,  we  do  performance  studies 
on  the  standard  GCRO-DR  algorithm  and  the  standard  RCG  algorithm. 

The  rest  of  the  paper  is  divided  as  follows.  In  Section  2,  we  give  a  brief  description  of 
the  two  application  areas.  We  describe  the  standard  GCRO-DR  algorithm  and  the  three  modi¬ 
fications  ( k  >  niter,  different  recycle  subspace  selection  criteria,  and  code  optimizations)  in 
Section  3.  In  Section  4,  we  give  results  from  solving  the  embedded  uncertainty  quantification 
application  by  modified  GCRO-DR.  We  also  show  performance  results  for  standard  GCRO- 
DR  for  climate  modeling  and  standard  RCG  for  embedded  uncertainty  quantification.  For 
the  experiments,  we  use  a  combination  of  recycling  solvers  implemented  in  the  Belos  pack¬ 
age  of  the  Trilinos  project  [5]  and  Matlab.  The  conclusions  and  future  work  are  discussed  in 
Section  5. 

2.  Application  Areas.  Climate  modeling  consists  of  the  following  four  components:  an 
atmospheric  model,  an  ice  model,  a  land  model,  and  an  ocean  model.  We  focus  on  ice  sheet 
modeling  using  Glimmer,  a  community  ice  sheet  model  [16].  The  ice  sheet  thickness  (H)  is 
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described  by  the  following  continuity  equation  [4,  15]: 

CjTT 

-=B-V(uH ),  (2.1) 

ot 

where  B  is  the  surface  mass  balance,  u  is  the  average  velocity  of  the  ice  sheet  in  the  vertical 
direction,  and  V  is  the  gradient  operator  (horizontal).  To  model  large  ice  sheets,  shallow  ice 
approximation  is  commonly  used.  This  approximation  assumes  that  the  bedrock  and  the  ice 
surface  have  small  slopes.  Hence,  the  normal  stress  components  are  neglected  [4,  7].  Finite 
difference  discretization  of  the  resulting  equation,  on  a  staggered  grid  with  periodic  boundary 
conditions,  leads  to  a  sequence  of  linear  systems. 

Embedded  uncertainty  quantification  involves  the  following  stochastic  linear  elliptic 
PDE  [11]: 


V-(fl(j?,f)V«(j?,f))=/  (2.2) 

in  the  domain  [-0.5, 0.5]2  with  zero  Dirichlet  boundary  conditions.  In  (2.2),  £  is  the  input 
random  variable.  The  diffusivity  is  randomized  and  is  given  by 


a(x,£)  =  1  +  <x— £  cos 

7ll 


(2.3) 


where  <x  is  the  standard  deviation  of  the  random  field  a.  The  forcing  function  /  is  chosen  by 
applying  the  following  exact  solution  to  (2.2): 


u(x , £)  =  exp  (-  |  £  |2)  16  (a2  -  0.25)  (y2  -  0.25) .  (2.4) 

Solving  this  stochastic  PDE  (2.2)  by  a  sparse  grid  collocation  method  leads  to  a  set  of  un¬ 
coupled,  deterministic  PDEs.  We  use  a  finite  difference  discretization  on  a  5 -point  stencil  to 
obtain  a  sequence  of  linear  systems  (for  details  see  [1 1]). 

3.  Standard  and  Modified  GCRO-DR.  In  Sections  3.1  and  3.2,  we  briefly  describe 
the  theory  behind  the  standard  GCRO-DR  algorithm  and  its  modified  version,  respectively. 
The  pseudo-code  for  our  modified  GCRO-DR  is  given  in  the  appendix.  The  changes  from 
standard  GCRO-DR  are  marked  in  blue.  For  the  pseudo-code  of  the  standard  GCRO-DR 
algorithm,  see  the  appendix  in  [13]. 

3.1.  Standard  GCRO-DR.  After  solving  the  ith  system  in  (1.1),  GCRO-DR  retains  k 
approximate  eigenvectors  of  Aw,  which  are  used  to  compute  the  matrices  U,  C  e  Cnxk , 
such  that  range(fZ)  is  an  approximate  invariant  subspace  of  A w  (and  hopefully  also  of  A(i+1)), 
A(i+l)U  =  c,  and  CHC  =  /. 

GCRO-DR  uses  a  modified  Arnoldi  process  to  compute  the  orthogonal  basis  for  the 
Krylov  subspace  such  that  each  new  Krylov  vector  is  also  orthogonal  to  range(C).  This 
produces  the  Arnoldi  relation 


(/  -  CCH)A(i+l)Vm  =  Vm+lHm,  (3.1) 

where  H_m  is  an  (m  +  1)  x  m  upper  Hessenberg  matrix.  GCRO-DR  finds  the  optimal  solution 
over  the  (direct)  sum  of  the  recycle  space,  ran ge(f/),  and  the  new  search  space,  range(Vm). 

3.2.  Modifications  to  Standard  GCRO-DR.  We  discuss  the  three  modifications  to  the 
standard  GCRO-DR  algorithm  that  make  it  attractive  for  sequences  with  rapidly  converging 
linear  systems.  The  first  modification  adapts  the  algorithm  for  the  case  where  the  number  of 
vectors  selected  for  recycling  can  be  set  greater  than  (or  equal  to)  the  number  of  iterations 
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Eigenvalues  of  preconditioned  operator  for  i  =  40  and  n  =  256 
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Fig.  3.1.  Eigenvalue  Distribution. 


required  for  convergence  ( k  >  niter).  One  assumption  that  we  maintained  from  the  standard 
GCRO-DR  is  that  the  recycled  subspace  size  be  less  than  the  length  of  a  cycle,  or 

k  <  m.  (3.2) 

Previously  the  following  combinations  were  possible: 

k  <  m  <  niter  and  k  <  niter  <  m.  (3.3) 

Now,  the  following  is  also  possible: 

niter  <k  <m.  (3.4) 

The  second  modification  involves  a  different  recycle  subspace  selection  criterion.  As  for 
any  deflation  based  recycling  algorithm,  GCRO-DR  uses  an  approximate  invariant  subspace 
(corresponding  to  the  eigenvalues  close  to  the  origin)  to  build  the  recycle  space.  It  turns  out 
that  for  our  applications,  the  spectrum  of  the  preconditioned  operator  has  clustered  eigenval¬ 
ues  that  are  well- separated  from  the  origin  (see  Figure  3.1).  Thus,  this  criterion  does  not  work 
well.  Hence,  we  focus  on  the  dominant  components  of  the  right-hand  side  to  build  the  recycle 
space.  A  similar  approach  has  been  used  in  [8]  (also  see  [2]).  Let  range(£/)  be  the  recycle 
space  available  from  the  previous  linear  system,  and  range(Vm)  be  the  new  search  space  from 
(3.1).  We  first  build  the  matrix1 

vp  =  [u  Vp] ,  (3.5) 

and,  then  pick  those  k  vectors  of  Vp  that  have  a  large  component  in  the  right  hand  side.  The 
GCRO-DR  implementation  is  modified  such  that  the  user  can  specify  the  recycle  subspace 
selection  criterion  via  the  input  parameter  crit.  Currently,  the  two  options  for  it  are  eigen, 
for  approximate  eigenvectors  based  criterion,  and  arnoldi,  for  this  new  component  based 
criterion. 

The  third  modification  avoids  the  last  orthogonalization  in  the  algorithm2.  When  k  > 
niter,  the  algorithm  performs  just  one  cycle.  This  implies  that  the  recycle  space  generated 


lrThis  excludes  the  first  system  in  the  sequence,  which  requires  special  handling. 
2  Again,  we  have  not  done  this  for  the  first  system  in  the  sequence. 
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in  the  current  system,  A^x  =  b^l\  is  not  used  as  a  search  space  for  this  same  system,  rather, 
it  is  used  as  the  search  space  for  the  next  system  in  the  sequence,  A(i+1)x  =  b^l+l\  Therefore, 
the  computation  for  enforcing 


CHC  =  I  with  C  =  A{i)U  (3.6) 

can  be  avoided.  This  does  not  affect  the  execution  of  the  subsequent  solves,  because,  at  the 
start  of  solving  the  (i  +  \)th  linear  system,  the  following  is  always  enforced: 


CHC  =  /withC  =  A(i+l)U. 


(3.7) 


For  details,  see  the  pseudo-code  in  the  appendix  (lines  45-51  and  54-58). 

4.  Numerical  Results.  We  give  results  from  solving  the  embedded  uncertainty  quan¬ 
tification  application  by  modified  GCRO-DR  in  Section  4.1.  The  performance  studies  of 
standard  GCRO-DR  for  climate  modeling  and  standard  RCG  for  embedded  uncertainty  quan¬ 
tification  are  described  in  Section  4.2.  For  experiments  in  both  the  subsections,  relative  con¬ 
vergence  tolerance  was  set  to  10-6. 

4.1.  Modified  GCRO-DR  for  Embedded  Uncertainty  Quantification.  We  used  the 
Trilinos  software  package  (version  10.2)  to  generate  a  sequence  of  preconditioned  linear 
systems  from  the  embedded  uncertainty  quantification  application  (example  twoD_diffusion- 
_collocation_example.cpp  in  the  Stokhos  package).  The  application  generated  a  sequence  of 
matrices,  right  hand  sides,  and  multilevel  preconditioners.  The  rest  of  the  experimentation 
was  done  in  Matlab.  We  did  experiments  on  two  system  sizes,  n  =  256  and  n  =  65536.  We 
compared  the  iteration  count  and  the  runtime  of  our  modified  GCRO-DR  with  the  iteration 
count  and  runtime  of  Matlab’ s  implementation  of  GMRES.  The  results  are  given  in  Figure 

4.1.  We  used  m  =  11  and  k  =  10.  The  timing  data  are  given  in  Table  4.1.  As  evident  from 
the  figure  and  the  table,  the  saving  in  the  iteration  count  is  around  50%,  and  the  saving  in  the 
runtime  is  close  to  13%. 


Table  4. 1 

Timing  Comparison 


n 

GMRES  Time 

Modified  GCRODR(11,10)  Time 

Saving  in  Time 

256 

2.51 

2.19 

12.92% 

65536 

238.46 

208.34 

12.63% 

4.2.  Performance  Study.  The  experiments  in  this  subsection  were  done  using  the  Trili¬ 
nos  and  the  Glimmer  software  packages.  We  studied  performance  of  standard  GCRO-DR 
with  climate  modeling  as  the  target  application.  An  ILU  type  preconditioner  was  used.  The 
code  was  profiled  to  measure  the  time  for  recycling,  the  time  for  the  recycling  orthogonaliza- 
tion  step  (i.e.,  application  of  (/  -  CCH)  in  (3.1)),  and  the  time  for  application  of  the  operator 
and  the  preconditioner.  The  recycling  time  includes  all  computations  involving  use  of  the 
recycle  space  (using  U  and  C),  and  all  computations  done  to  generate  the  recycle  space  (gen¬ 
erating  U  and  C).  Hence,  the  time  for  the  recycling  orthogonalization  step  is  included  in  the 
time  for  recycling.  The  total  time  to  solve  a  linear  system  consists  of  the  time  for  recycling, 
the  time  for  application  of  the  operator  and  the  preconditioner,  the  time  for  initializations  (not 
reported  here),  and  the  time  for  the  other  steps  of  the  standard  GMRES  algorithm  e.g.  normal 
orthogonalization  time  for  the  Arnoldi  basis  (not  reported  here).  This  was  done  for  several 
values  of  m  and  k.  The  results  for  n  =  7904,  m  =  8,  and  k  =  4  are  given  in  Figure  4.2  (a).  It 
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Iterations  for  GCRO-DR  w/  and  w/o  Recycling  for  n  =  256.  Time  for  GCRO-DR  w/  and  w/o  Recycling  for  n  =  256. 


(a) 


(b) 


Iterations  for  GCRO-DR  w/and  w/o  Recycling  for  n  =  65536. 
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Time  for  GCRO-DR  w/and  w/o  Recycling  for  n  =  65536. 
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Fig.  4.1.  Iteration  and  Timing  Plots. 


is  clear  from  the  figure  that  the  recycling  time  is  modest  compared  with  the  time  required  to 
apply  the  operator  and  the  preconditioner. 

For  the  performance  study  of  RCG,  we  used  the  embedded  uncertainty  quantification 
application3.  As  in  Section  4.1,  a  multilevel  preconditioner  was  used.  The  code  was  profiled 
to  measure  the  time  for  recycling,  the  time  for  application  of  the  operator,  and  the  time  for 
multiplying  by  the  preconditioner.  As  for  GCRO-DR,  the  recycling  time  includes  all  com¬ 
putations  involving  use  of  the  recycle  space  and  generation  of  the  recycle  space.  The  total 
time  to  solve  a  linear  system  is  the  sum  of  the  time  for  recycling,  the  time  for  application  of 
the  operator,  the  time  for  multiplying  of  the  preconditioner,  the  time  for  initializations  (not 
reported  here),  and  time  for  other  steps  of  the  standard  CG  algorithm  e.g.  computing  standard 
iteration  vectors  x,  r,  and  p  (not  reported  here).  Again,  the  experiments  were  done  for  several 
values  of  m  and  k.  The  results  for  n  =  65536,  m  =  8,  and  k  -  4  are  given  in  Figure  4.2  (b). 
As  for  GCRO-DR,  the  recycling  time  in  RCG  is  modest  compared  with  the  time  to  apply  the 
operator  and  the  preconditioner. 

5.  Conclusions  and  Future  Work.  In  this  paper,  we  discuss  the  modifications  to  the  ex¬ 
isting  GCRO-DR  algorithm  for  rapidly  converging  sequence(s)  of  linear  systems.  The  num¬ 
ber  of  vectors  selected  for  recycling  can  now  be  set  greater  than  the  total  iterations  required 
for  convergence  of  a  linear  system.  We  present  a  new  criterion  for  selecting  the  recycle  sub- 


3 The  original  application  used  AztecOO  for  linear  solvers.  We  switched  it  to  use  Belos.  This  also  demonstrated 
the  use  of  ML  preconditioner  with  Belos.  We  also  fixed  a  memory  leak  encountered  in  the  original  application. 
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Fig.  4.2.  Performance  Study  Results. 


space,  and  also  incorporate  few  code  optimizations.  We  show  that  this  modified  GCRO-DR 
solves  the  embedded  uncertainty  quantification  application  faster  compared  with  the  standard 
GMRES  algorithm.  Similar  modifications  are  done  to  the  existing  RCG  algorithm  but  are  not 
described  here.  We  also  do  performance  studies  on  standard  GCRO-DR  and  standard  RCG 
to  show  that  recycling  does  not  impose  a  substantial  overhead. 

Currently,  we  are  fixing  the  boundary  condition  implementation  for  the  embedded  un¬ 
certainty  quantification  application.  This  will  lead  to  SPD  systems.  Since  CG  is  the  optimal 
method  for  SPD  systems,  we  can  then  use  modified  RCG.  We  plan  to  modify  the  implemen¬ 
tation  of  GCRO-DR  and  RCG  in  Trilinos,  and  use  them  to  generate  new  numerical  results. 

Future  work  involves  two  tasks.  Firstly,  we  plan  to  profile  the  GCRO-DR  and  RCG 
codes  for  memory  usage  using  Tau.  Secondly,  we  plan  to  implement  a  grouping  strategy  for 
the  sequence  of  linear  systems  arising  in  the  embedded  uncertainty  quantification  application. 
This  strategy  has  been  shown  to  accelerate  the  recycling  algorithms  in  [10]. 
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Appendix 

A.  Pseudo-Code  for  the  Modified  GCRO-DR  Algorithm. 

l:  Choose  m,  the  length  of  the  cycle,  k ,  the  desired  number  of  recycle  vectors,  and  crit, 
the  recycle  space  selection  criteria  (eigen  or  arnoldi).  Let  tol  be  the  convergence 
tolerance.  Choose  an  initial  guess  xo.  Compute  r0  =  b  -  Ax o,  and  seU'  =  1. 

2:  if  Ys  is  defined  (from  a  previous  system;  s  is  number  of  columns  of  Ys)  then 
3:  CallbuildUC. 

4:  Xi  =  x0  +  t/sCfr0 

5:  n  =r0-  CjCf  r0 

6:  else 

V:  Vi  =  ro/lkolh 

8:  c=  \\ro\\2ei 

9:  Perform  m  steps  of  GMRES,  solving  min\\c  -  H_my\\2  for  y.  Let  GMRES  converges  in 

p  steps  generating  Vp+\  and  H  . 

10:  x\  =  xo  +  Vpy 

11:  n  =  Vp+i(c-Hpy) 

12:  if  k  >  p  then 

13:  S^=  p 

14:  Ys  = 

15:  CallbuildUC. 

16:  else 

17:  S  =  k 

18:  Compute  the  k  eigenvectors  z}  of  ( Hp  +  h2p+l  pH~HepepJzj  =  OjZj  associated  with 

the  smallest  magnitude  eigenvalues  Qj  and  store  in  Pk. 

19:  Ys  =  VpPk 

20:  Let  [ Q ,  R]  be  the  reduced  QR-factorization  of  H  Pk • 

21:  Cs  =  Vp+iQ 

22:  Us  =  YSR~' 

23:  end  if 

24:  end  if 

25:  while  ||g||2  >  tol  do 

26:  i  =  i  +  1 

27:  Perform  m  Arnoldi  steps  with  the  linear  operator  (/  -  C?Cf  )A,  letting  v\  = 
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r/_x/||rz--i  H2.  Again,  let  the  algorithm  converges  in  p  steps  generating  Vp+\,  H_p,  and 
Bp.  _ 

28:  Let  Ds  be  a  diagonal  scaling  matrix  such  that  Us  =  USDS  where  the  columns  of 

U s  have  unit  norm. 


29: 

30: 

31: 

32: 

33: 

34: 

35: 

36: 

37: 

38: 

39: 

40: 

41: 

42: 

43: 

44: 

45: 

46: 

47: 

48: 

49: 

50: 

51: 

52: 

53: 

54: 

55: 

56: 

57: 

58: 

59: 

60: 

61: 

62: 


t  =  S  +  p 


Vt  =  [Us  Vp] 


Wt+i  =  [Cs 


Vp+ 1 

Bp 

H„ 


Solve  minWW^n-i  -  Gty\\2  lor  y. 


xt  =  xt- 1  +V,y 
n  =  n-\  -  w,+1Gty 

if  k  >  t  then 


Call  buildUC. 
else 

s  -  k 

if  crit  ==  eigen  then 

Compute  the  k  eigenvectors^  of  G^G^Zi  =  ^iQ^Wf+lV^Zi  associated  with 
smallest  magnitude  eigenvalues  6t  and  store  in  P^. 

Ys  =  VtPk 

if  p  ==  m  then 

Let  [ Q ,  R]  be  the  reduced  QR-factorization  of  GtPk- 

Cs  =  Wt+iQ 
us  =  YsR~l 

else 


Us  =  Ys 

end  if 

else 

^  Pick  those  k  columns  of  Vt  that  have  large  components  in  right  hand  side 
and  store  in  Ys. 

if  p  --  m  then 

Call  buildUC. 


else 


Us  =  Ys 

end  if 
end  if 
end  if 
end  while 

Let  Ys  =  Us  (for  the  next  system) 


buildUC 

1:  Let  [ Q ,  R ]  be  the  reduced  QR-factorization  of  AYS. 

2:  Cs  =  Q 
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STABILITY  OF  ORDINARY  DIFFERENTIAL  EQUATIONS  WITH  COLORED 

NOISE  FORCING 

TIMOTHY  J.  BLASS§  AND  LOUIS  A.  ROMERO 

Abstract.  We  present  a  method  for  determining  the  stability  of  a  class  of  stochastically  forced  ordinary  differ¬ 
ential  equations,  where  the  forcing  term  can  be  of  quite  general  form.  We  use  the  Fokker-Planck  equation  to  write 
a  low-order  partial  differential  equation  for  the  second  moments,  which  we  turn  into  an  eigenvalue  problem  for  a 
second-order  differential  operator.  The  eigenvalues  of  this  operator  determine  the  stability  of  the  system.  Inspired  by 
Dirac’s  creation  and  annihilation  operator  method,  we  develop  “ladder”  operators  to  determine  analytic  expressions 
for  the  eigenvalues  and  eigenfunctions  of  the  operator. 

1.  Introduction.  The  original  goal  of  this  work  was  to  develop  a  framework  in  which 
to  analyze  the  stability  of  the  stochastically  forced  Mathieu  equation: 

x  +  yx  +  (o)q  +  sf(t))x  =  0,  (1.1) 

where  the  stochastic  process  f(t)  is  determined  by 

s  -  -ks  +  crn(t ), 

u  =  -au  +  J3s ,  (1.2) 

fit)  =  a s{t)  ~/3u(t). 

n(t)  is  “white  noise”,  meaning  that  dWt  =  n(t)dt ,  where  Wt  is  a  Brownian  motion.  We 
refer  to  /  as  a  second-order  filter  because  it  is  generated  by  a  second  order  ODE  system. 
Equation  (1.1)  is  a  model  for  the  dispersion  relation  of  a  capillary  wave  in  a  time- varying 
gravitational  field,  with  f(t)  producing  the  random  fluctuations  in  acceleration,  [5].  If  a,  k  > 
0,  the  stochastic  process  (1.2),  is  stationary  as  t  — >-  oo,  with  power  spectral  density 

era2  oj2 

S(0J)=  (oj2  +  k2)(oj2  + fi2) 

This  is  of  interest  for  applications  where  the  forcing  term  represents  an  acceleration,  because 
acceleration  is  the  second  derivative  position,  so  its  Fourier  transform  should  have  the  form 
S  (oj)  =  oj2g(oj)  where  g  is  continuous  at  zero.  An  example  profile  for  S  (oj)  is  shown  in  Figure 
1.1. 

As  explained  below,  the  stability  is  determined  by  solving  an  eigenvalue  problem  for  a 
differential  operator  derived  from  the  stochastic  differential  equations.  The  stability  of  (1.1) 
was  determined  using  numerical  and  perturbation  techniques.  The  analysis  begins  with  the 
simplified  case  of  a  first-order  filter  forcing  a  first-order  equation  was  very  useful.  The  first- 
order  filter  is  obtained  by  eliminating  the  equation  for  u  in  (1.2).  This  simpler  system  has  the 
form 


s  =  —ks  +  crn(t ), 

x  =  (-y  +  sf(t))x.  (1.3) 

This  form  of  /  no  longer  represents  and  acceleration,  but  the  eigenvalues  and  eigenfunctions 

that  determine  the  stability  have  analytic  expressions.  The  eigenfunctions  are  given  in  terms 

2 

of  Hermite  polynomials,  and  the  eigenvalues  are  of  the  form  {-na  -  2y  +  2^- }„>0.  Using 
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Fig.  1.1.  S(oj )  vs  ajfor  a  =  0.3,  k  =  0.5 ,j3  =  l,cr  =  1 


the  Hermite  polynomials  as  a  basis  was  key  to  generalizing  the  analysis  to  more  complicated 
systems,  and  was  an  important  part  of  developing  our  numerical  approach.  This  is  explained 
in  detail  in  Section  3. 

The  next  step  towards  understanding  the  stability  of  solutions  to  (1.3)  and  the  stability 
of  solutions  to  (1.1)  is  understanding  the  stability  of  solutions  to  a  first-order  equation  that  is 
forced  with  a  second-order  filter.  This  has  the  form 

s  -  -ks  +  crn(t ), 

u  =  —au  +  J3s  (1.4) 

x  =  -yx  +  sf(t)x , 
f{t)  =  as(t )  +  bu{t). 

All  parameters  in  (1.7)  are  real,  and  k,  a,  y  >  0  so,  in  the  absence  of  noise,  the  solutions 
would  be  bounded.  This  is  not  the  most  general  form  of  a  second-order  filter,  but  is  a  physi¬ 
cally  relevant  form.  Surprisingly,  we  observed  numerically  and  confirmed  with  perturbation 
theory  that  the  eigenvalues  that  determine  the  stability  of  (1.4)  appeared  in  constant  incre¬ 
ments,  and  had  an  s2  dependence,  just  as  in  the  case  of  system  (1.2).  This  led  us  to  search 
for  analytic  expressions  for  the  eigenvalues  and  eigenfunctions,  which  we  have  obtained.  In 
doing  so,  we  found  that  the  framework  extends  to  an  ft-th  order  filter,  similar  to  the  form  (1.7) 
but  with  n  equations  generating  /.  That  is,  of  the  form 

s  =  -Bs  +  n  {t), 

x  =  -yx  +  scTsit)x ,  (1.5) 

c  and  sit )  are  ft- vectors,  and  B  is  an  n  x  n  matrix,  and  n (t)  is  an  ft- vector  of  white  noises. 

1.1.  Stability.  We  determine  the  stability  of  (1.4)  by  the  long-time  behavior  of  the  sec¬ 
ond  moments  of  the  solutions.  The  case  of  forcing  by  white  noise  has  been  studied  exten¬ 
sively,  but  white  noise  is  not  realistic  as  a  forcing  term.  In  addition,  if  the  system  is  linear 
and  the  forcing  is  by  white  noise,  then  the  associated  equations  for  the  moments  of  the  so¬ 
lutions  are  a  simple  system  of  ordinary  differential  equations  (see  [1]).  The  nonlinear  term 
fit)x  in  our  system  precludes  this  type  of  approach  to  our  study  of  stability.  Van  Kampen 
has  presented  a  heuristic  approach  to  the  case  of  colored  noise,  [6].  We  improve  on  this  by 
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developing  a  rigorous  theory  for  colored  noise  forcing.  We  consider  the  second  moment  in  x 
as  a  function  of  s  and  u ,  as  well  as  t : 

Mxx(s ,  u,  t)  =  I  x2P(x,  s ,  i/,  0  dv.  (1.6) 

Jr 

In  the  most  general  case  of  a  second-order  filter.  That  is, 

i1  =  —ks  +  vu  +  crnft ), 

ii  = -au  +  J3s  +  pri2(t)  (1.7) 

i:  =  -yv  +  sf(t)x , 

/(0  =  <2^(0  +  Z?w(0, 

will  satisfy  a  PDE  in  s,  u ,  and  in  the  case  of  a  first-order  filter,  the  equation  will  be 
only  in  s,  t.  The  function  P(x ,  s,  u ,  f)  in  (1.6)  is  the  joint  probability  density  function  for  (1.7), 
which  is  the  solution  to  the  associated  Fokker-Planck  equation,  with  f(t)  =  as{t)  +  bu(t ), 


d,P  =  °jd2sP  +  P- d2uP  +  d,[(KS  -  VU)P]  + 

+  du[(au  -  J3s)P ]  +  dx[(y  -  s(as  +  bu))xP].  (1.8) 
If  we  multiply  this  equation  by  x2  and  integrate  with  respect  to  dx,  we  arrive  at  the  equation 


d,Mxx  =  ^-d2Mxx  +  P- d2uMxx  +  3s\(ks  -  vu)Mxx]+ 

+  du[(au  -j3s)Mxx]  +  (-2 y  +  2 s(as  +  bu))Mxx.  (1.9) 

For  notational  convenience,  we  define  the  operator  D  by  the  operator  giving  the  right-hand- 
side  of  (1.9).  That  is,  we  can  rewrite  (1.9)  as  dtMxx  =  DMXX.  If  D  has  a  complete  set  of 
eigenfunctions,  then  we  can  write  a  solution  to  (1.9)  as  a  linear  combination  of  the  eigenvec¬ 
tors  multiplied  by  an  exponential  in  t.  Thus,  we  are  led  to  look  for  Mxx(s ,  u ,  t)  -  eAtM(s ,  u). 
Such  a  function  M  would  solve  the  eigenvalue  problem 

AM  =  DM.  (1.10) 

If  there  is  a  solution  to  (1.10)  with  A  >  0,  then  the  system  (1.7)  will  be  unstable.  We  find  an 
analytic  expression  for  the  eigenvalues,  A  and  are  thus  able  to  predict  the  stability. 

As  explained  above,  two  useful  and  illuminating  examples  are  for  the  cases  (1.3)  and 
(1.4).  Another  motivation  for  studying  (1 .3)  is  that  the  form  of  /,  while  no  longer  representing 
an  acceleration,  is  a  commonly  used  process  (Ornstein-Uhlenbeck)  in  modeling,  so  it  should 
be  of  interest  to  those  interested  in  many  applications.  The  result  in  this  case,  which  is  proved 
in  Section  2.1,  is 

Theorem  1.1.  The  solutions  to  (1.3)  are  stable  for  0  <  s  <  For  the  second-order 

filter,  we  have  the  following  theorem,  as  proved  in  Section  2.2. 

Theorem  1.2.  The  solutions  to  (1.4)  are  stable  for  0  <  s  < 

The  operator  D  shares  similar  structure  to  the  quantum  harmonic  oscillator,  which  Dirac 
solved  using  creation  and  annihilation  operators  (also  called  ladder  operators),  see  [3].  In¬ 
spired  by  this  work,  we  determine  the  eigenvalues  and  eigenfunctions  of  D  via  ladder  opera¬ 
tors.  A  ladder  operator  is  any  operator  X ,  such  that 


[D,X\  =  pX, 


(l.H) 
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for  /i  A  0.  Here  [•,  •]  denotes  the  commutator.  If  D0  =  d0  then  D(X(p)  =  (d+yu)X0,  so  that  Xcj) 
is  also  and  eigenfunction  of  T>  with  eigenvalue  d+yu,  provided  that  X(p  is  not  identically  zero.. 
In  the  second-order  filter  case,  we  find  pairs  of  these  operators,  Xf,X±  where  Xf  increments 
by  ±yU;.  Thus,  the  eigenvalues  of  (1.10)  form  a  bi-infinite  family,  given  by  {do  +  ip\  +  jpi)i,je z> 
for  some  do  e  R.  In  the  case  of  the  first-order  filter,  one  of  the  yU;  is  zero,  so  that  the  family 
is  indexed  only  by  one  copy  of  Z.  The  details  of  constructing  these  operators  is  the  subject 
matter  of  Section  2.  In  general,  an  n- th  order  system  will  have  n  pairs  of  such  operators, 
where  if  one  increments  by  p  the  other  in  the  pair  increments  by  -fi. 

The  entire  stability  analysis  for  (1.7)  can  be  done  in  terms  of  solutions  to  (1.11).  If  we 
define  the  operators  L,  =  dSi ,  for  0  <  i  <  n  and  L,  =  st  for  n  +  1  <  i  <  2n  and  L2n+i  =  1, 
we  an  write  each  solution  X^  of  (1.11)  as  X^  =  rfU,  and  (1.11)  becomes  an  eigenvalue 
problem  of  the  form  SAxk  =  ,  where  S  is  symmetric  and  A  is  antisymmetric.  From  this, 

we  can  show  that  X+  commutes  with  all  the  other  solutions  to  (1.11),  except  for  its  paired 
operators,  X~,  where  the  commutator  is  a  constant.  This,  in  turn  allows  us  to  write 

D  =  vkX-kX+k  +  ,t0  (1.12) 

k 

and  to  determine  the  base  eigenvalue.  That  is  we  can  find  do,  which  will  be  the  largest 
eigenvalue  of  D,  thereby  determining  the  stability  of  solutions  to  (1.7). 

2.  Ladder  Operators.  We  are  interested  in  constructing  ladder  operators  for  D.  That 
is,  we  seek  operators  X  satisfying  the  commutation  relation 

[D,X]=pX.  (2.1) 

To  find  such  an  X ,  we  observe  that  the  differential  operators  ds,  du,  as  well  as  multiplication 
by  s  or  u ,  all  have  nice  commutation  relations  with  D: 

[D,  ds ]  =  -a :ds  +  y 6du  -  2 as 
[D,  s]  =  cr2ds  +  ks  -vu 

(2.2) 

[D,  du ]  =  -adu  +  wds  -  2 bs 
[D,  u]  =  p2du  +  au  -  J3s. 

Except  for  the  constants  that  appear  in  the  relations  for  [D,  ds ]  and  [D,  du],  the  commutator 
of  D  with  each  operator  can  be  written  in  terms  of  the  other  operators.  Hence,  if  we  include 
constants,  (with  [D,  1]  =  0)  we  can  build  X  as  a  linear  combination  of  the  operators.  We 
write 


X  —  C\ds  +  c2s  +  c2du  +  c^u  +  C5 


(2.3) 


and  require  the  Cj  to  satisfy  (2.1).  This  becomes  a  simple  matrix  eigenvalue  problem  Tc  =  yuc, 
in  detail 


(  ~K 

cr 2 

V 

0 

0  1 

'  ci  N 

'  Ci  > 

0 

K 

0 

~P 

0 

^2 

Cl 

p 

0 

-a 

P2 

0 

^3 

=  P 

C3 

0 

-v 

0 

a 

0 

C4 

c4 

<  -2  as 

0 

-2  be 

0 

0  > 

<  C5  > 

<  C5  > 

The  eigenvalues  of  T  are 

yuejo,  ±^(a  +  k  ±  yj(a  -  k )2  +  4 fiv 


(2.4) 


(2.5) 
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2.1.  First  Order  Filter.  In  the  particular  cases  discussed  in  the  introduction,  these  sim¬ 
plify  greatly.  In  the  first  order  case,  we  have  a  =  v=  j3=p  =  b  =  0  and  a  =  1 .  This  is  the 
case  of  a  first-order  filter,  and  D  has  the  form:  tDf  =  +  K8s(sf )  +  (-2 y  +  2s  s )f.  The 

stability  of  the  solutions  to  (1.7)  are  completely  characterized  by  the  following  theorem. 
Theorem  2.1.  In  the  first-order  filter  case,  the  eigenvalues  ofD  are  given  by 

s2cr 2 

An  =  —net  -  2y  +  2 — — ,  n>  0. 

Kl 

The  eigenfunctions  ofD  and  are  given  by  fn  =  ( X~)n  fo(s)  where 

<f>o(s)  =  and  x-  =  Kds  +  2e, 

Hence,  the  eigenfunctions  are  products  of  exponentials  with  the  Hermite  polynomials,  which 

are  complete  in  L 2.  Note  that  the  result  of  Theorem  1.1  follows  immediately  from  the  result 

2  2 

of  Theorem  2.1.  If  the  eigenvalues  are  indeed  given  by  An  =  -na  -  2y  +  2^-f1,  where  n  >  0, 

2  2 

then  the  largest  eigenvalue  is  To  =  -2 y  +  2 §-^~,  and  the  stability  of  the  solutions  to  (1.7)  is 
decided  precisely  by  the  sign  of  this  eigenvalue.  Hence,  the  stability  barrier  is  To  =  0,  for 

which  s  =  - ,  which  is  the  condition  in  Theorem  1.1. 

kl 

Theorem  2.1  depends  on  the  following  three  lemmas. 

Lemma  2.2.  Iff  is  an  eigenfunction  ofD  with  eigenvalue  A,  then  ^  =  X±f  is  either  the 
zero  function  or  is  an  eigenfunction  of  ID  with  eigenvalue  A  ±  k,  where  the  operators  X±  are 
defined  by 


X+  ~  k8s  +  2—s  -  2s,  X~  =  k8s  +  2s. 

<jl 


Proof  For  a  =  v=  fi  =  p  =  b  =  0  and  a  =  1,  the  nontrivial  eigenvalues  and  their 
eigenvectors  of  (2.4)  become 


K  ) 

r  K  ' 

24 

0 

cr- 

0 

II 

I 

II 

(N 

0 

0 

0 

<  —2s  > 

v  2s  > 

This  means  that  the  differential  operators  X±  satisfy  [D,^]  =  ±kX±.  Hence,  DX±f  = 

X±Df  ±  KX±f  =  (T  ±  K)X±f.  Hence,  if  X±f  is  not  identically  zero,  then  it  is  a  new  eigenfunc¬ 
tion  for  D.  □ 

2  2 

Lemma  2.3.  The  eigenvalues  ofD  are  bounded  above  by  To  =  -2  y  +  2 
Proof  This  follows  from  two  simple  relationships  between  X+  and  X~ .  Let  L2(w)  be  the 
weighted  Hilbert  space  of  equivalence  classes  of  measurable  functions  such  that  f(x)f(x)w(x)dx  < 
oo.  Then,  it  is  a  simple  calculus  exercise  to  verify  that  L2(w),  for  w(s)  =  eKs2^2 ,  then 


(2.6) 


It  is  also  easy  to  see  that 


2  2  2 

yK  X+X~<p  =  D<t>  +  (2 r  -  2^22)0. 


(2.7) 
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Combining  these  facts,  we  see  that  if  0  is  an  eigenfunction  of  D  with  eigenvalue  A,  the 
(A  +  2y-  2^)M2lHw)  =  [(A  +  2 y-  2 


2  2 

£ <J 


2  o2/cr2 


=  f  (D<p)(p  +  (2y-  2'^—)<p1e 

Jr  k 

=  f  —(X+X-<p)<peKs2/(T2ds 
Jr  2 k 

=  -  f  ^(X-cp)(X~cf,)e^^ds 

Jr  2k 


ds 


-^\\x-n2L2(w)<o. 


2k 


Hence,  we  must  have  A  <  -2 y  +  2^§^.  □ 

Lemma  2.4.  The  function  (po(s)  =  e~Ks2/cr~+2ss/K  solves  X+0o  =  0,  and  is  an  eigenfunction 
of  ID  with  eigenvalue 


Aq  =  -2  y  +  2 


2  2 

£  <J 


(2.8) 


which  is  the  largest  eigenvalue  of  D. 

Proof  The  previous  two  lemmas  imply  that  we  cannot  generate  new  eigenfunctions  of  D 
indefinitely  by  applying  X+  to  an  eigenfunction.  That  is  at  some  point,  we  must  have  X+f  =  0, 
otherwise  we  could  endlessly  generate  new  eigenfunctions  with  eigenvalues  increasing  to 
infinity,  contradicting  Lemma  2.3.  It  is  easy  to  see  that  0o  is  the  solution  to  X+f  =  0,  and  that 
D0o  =  To0o-  The  other  eigenfunctions  of  D  can  be  written  as  X~fo  because  not  only  are  these 
eigenfunctions  of  D  as  shown  in  Lemma  2.2,  but  they  generate  the  Hermite  polynomials, 
which  are  complete.  This  completes  the  proof  of  Lemma  2.4  as  well  as  the  proof  of  Theorem 
2.1.  □ 


2.2.  Second  Order  Filter.  For  the  second-order  filter,  with  y=p  =  0,  a  =  a,  b  -  -f$, 
the  operator  D  is 

<X2  9 

D0  =  —  <920  +  Kds(sf )  +  du((au  -  J3s)(p )  +  (-2 y  +  2£(as  -J3u))(p. 

This  case  is  similar  to  the  first  order  filter,  though  some  details  are  more  complicated.  We 
will  sketch  the  main  ideas,  many  of  which  generalize  filters  of  arbitrary  degree. 

The  result  is  the  theorem: 

Theorem  2.5.  With  v  =  p  =  0,  a  =  a,  b  =  -fi  the  eigenvalues  ofD  are 


(a1  -J32)2cr2£2 


Anj  =  —na  -  ]k  -  2y  +  2 


n,  j  >  0. 


The  eigenfunctions  of  tD  are  given  by  4>n,j{s,  u)  =  (Xa)n(XK  f(po^(s,  u),  where  Xa  =  du  -  —£, 
X~  =  ds  +  —du  -  2(ff  +<xk-<x  )£  anc[ 

K  A  a~K  u  K(a-K)  ’ 


a  +  k 


a(a  +  k )2 
/32cr2 
2  (a2 


u  + 


2  a(a  +  k ) 
Per2 


-su 


-  P2)£  2£(2 a3  -  2 ap2  +  2a2 k  -  P2k) 


aK 


apK 


(2.9) 


00,0 (s,  u)  =  exp 
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The  form  of  0o,o  and  (f>nj  indicate  that  the  Hermite  polynomials  will  again  be  the  building 
blocks  of  the  eigenfunctions.  However,  in  this  case  it  is  not  as  simple  as  exponentials  times 
Hermite  polynomials,  because  there  is  a  mixing  of  the  variables  s  and  u. 

As  mentioned  above,  this  setting  is  considerably  more  complicated  than  the  first-order 
filter.  In  particular,  the  analogue  of  Lemma  2.3  is  not  available  directly.  However,  the  practical 
techniques  for  finding  the  eigenvalues  and  eigenfunctions  are  the  same.  In  fact,  many  of  the 
interesting  features  of  the  previous  section  can  be  seen  simply  by  studying  the  eigenvalue 
problem  (2.4)  in  more  detail.  It  is  in  this  framework  that  we  investigate  the  second-order 
case. 

To  begin,  by  solving  equation  (2.4),  we  find  the  four  operators: 


k 

X- 


a  ,  a  +  K  ,  I3  a  ,  P2  -  2«2 

Os  +  — T~s  +  —du  + - 2 - £’ 

< tl  2  a  az 

Ou  £•> 

a 

*  P  2(J32  -  an-  a2) 

ds  +  —  s  + - ou  + - — — — - e. 


2  cr2  a  +  i 


k(a  +  k) 


/?  2 (J32  +  <xk-  a2) 

do  “b  Ou  ,  s 


a  -  k 


k(cv  -  k) 


where  each  X  satisfies  the  relation  [D,  X]  =  cX ,  where  c  =  ±a,  ±k.  Hence,  each  X  generates 
a  new  eigenfunction  from  an  old  one,  and  increments  the  eigenvalue  by  ±a,  ±k,  provided  the 
function  is  not  identically  zero.  The  reasoning  is  exactly  the  same  as  for  Lemma  2.2. 
Furthermore,  the  largest  eigenvalue  is 


^o,o  -  +  2 


(a2  -  J32)2cr2£2 

a2K2 


(2.10) 


so  we  must  have  that  at  some  point  X+0  =  0  and  X+ K(j)  =  0,  for  the  same  function  0.  Other¬ 
wise,  by  applying  the  raising  operators  repeatedly  to  an  eigenfunction,  we  could  generate  an 
eigenfunction  with  eigenvalue  exceeding  To,o-  The  expression  (2.10)  for  To,o  gives  the  range 
of  s  stated  in  Theorem  1.2,  for  which  the  equation  is  stable. 

Once  we  have  this,  the  analogous  result  for  Lemma  2.4  is  obtained  by  simultaneously 
solving  the  first  order  PDEs 


X>  =  0,  2£0  =  0.  (2.11) 

The  result  is  the  formula  for  0o,o  given  in  Theorem  2.5.  These  results  have  been  tested  nu¬ 
merically,  as  described  in  Section  3. 

2.3.  General  Techniques.  The  main  tool  in  all  of  this  analysis  is  the  ladder  operators. 
These  are  written  as  a  linear  combination  of  the  basic  operators  s,  u,ds,du,  1  in  the  case  of 
the  second-order  filters.  We  introduce  the  notation  L\  =  dS9  L2  =  s,  L3  =  du,  L4  =  u,  L5  =  1. 
In  a  general  setting  where  the  variables  s,  u  are  replaced  by  yi,y2?  •  •  •  ,yn,  we  would  have 
2ft  +  1  operators  L\  =  dyi,  L2  =  y\, . . .  ,L2n  =  yn,  L2n+ 1  =  1.  A  ladder  operator  would  be 
X  =  Yu  j  xjLj  satisfying  [D,  X]  =  juX. 

This  equation  for  X  and  /i  is  the  same  as  (2.4)  from  Section  2.  There  are  several  nice 
properties  of  this  equation.  First,  it  is  easy  to  see  that  the  matrix  A,  with  components  atj  = 
[Li,  Lj ]  is  antisymmetric.  In  particular,  A  is  a  2ft  +  1  x  2ft  +  1  matrix  whose  last  row  and  last 
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column  are  all  zeros,  and  first  2 n  x  2 n  block  has  a  block-diagonal  form 


(  J  0  0 
0/0 


0 

0 


A  -  0  0/ 


0  1 

-1  0 


(2.12) 


0  0  0  0  0 


It  is  also  not  too  difficult  to  see  that  T>  must  be  a  linear  combination  of  Lffij.  That  is  D  = 
Yjij  CijLiLj.  This  representation  of  D  is  not  unique,  and  in  fact  we  can  arrange  it  so  that  the 
matrix  C,  with  components  c;j,  is  symmetric. 

Lemma  2.6.  The  equation  [D,  X]  =  jiX  can  he  written  as  a  matrix  eigenvalue  problem 
Tx  =  px  where  T  is  an  2n  +  1  x  2n  +  1  matrix,  which  can  be  written  as  T  =  SA  where  S  is 
symmetric  and  A  is  antisymmetric. 

Proof.  We  compute  and  expression  for  [D,  X]  in  terms  of  C  and  A. 


=  ^((C  +  CT)A)UmxmLi. 


For  the  equation  [D,  X]  =  pX,  this  implies  that  we  have 


In  matrix  notation,  this  is  just  (C  +  CT)Ax  =  px. Note  that  this  is  the  same  as  equation  (2.4) 
in  the  case  of  a  second-order  filter,  but  also  applies  to  higher  degree  filters.  We  define  S  = 
C  +  CT,  which  is  obviously  symmetric,  even  if  C  is  not.  Hence,  T  =  SA.  □ 

Lemma  2.7.  If  X  is  a  ladder  operator  of  D,  with  [D,X]  =  pX,  then  there  is  another 
ladder  operator  Y  such  that  [D,  Y]  =  -pY. 

Proof.  Because  T  is  the  product  of  a  symmetric  matrix  and  an  antisymmetric  matrix,  its 
eigenvalues  must  come  in  pairs  that  are  negatives  of  each  other.  Thus,  if  Tx  =  px ,  then  the 
coefficients  of  x  define  a  ladder  operator  for  D  with  increment  p.  But  Since  T  =  S A,  there 
is  another  eigenvector  of  T  such  that  Ty  =  -py,  and  elements  of  y  define  a  ladder  operator 
for  D  with  increment  -p.  Because  T  has  an  odd  number  of  dimensions,  it  will  always  have 
a  zero  eigenvalue.  In  this  case  x  =  y,  and  the  ladder  operator  increments  by  0.  In  fact,  the 
ladder  operator  is  just  the  identity.  □ 

We  can  also  show  that,  just  as  in  the  proof  of  Lemma  2.3,  the  operator  D  can  be  written 
as  a  linear  combination  of  the  ladder  operators. 

Lemma  2.8.  There  exist  computable  coefficients  Vj  such  that 
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Proof.  Without  loss  of  generality,  we  suppose  C  is  symmetric,  so  that  S  =  2C.  Let 
x±k  be  an  eigenvectors  of  T  with  eigenvalues  //+£,  where  ^  =  ~P-k-  We  can  always  do  this 
by  the  Lemma  2.7  And  let  y±k  be  the  eigenvectors  of  Tr  with  eigenvalues  /i+p.  So  we  have 
SAxk  =  jikXk  The  transpose  of  this  is  -ASy^  =  /i^y*,  or  AS yk  =  -\±kyk  -  If  we  left-multiply 
by  S,  we  get  T(Sy^)  =  SA(S yk)  =  -fik(Syk)  =  /i-k(Syk).  Hence,  we  have  see  that  Syk  is  an 
eigenvector  of  T  with  eigenvalue  yu_£,  so  it  is  a  multiple  of  x~k.  Thus,  we  have  the  relationship 
between  the  eigenvectors  of  T  and  those  of  TT .  We  define  Vk  as  the  real  number  such  that 
Syk  =  2vkX~k ,  where  yk  is  normalized  so  that  yk  •  Xk  =  1.  Therefore,  C yk  =  VkX~k •  With  this 
normalization,  we  have  Sij  =  Yk  rfy)’  so  that 

CUj  =  2  cUpdp,j  =  2  xi  cj,Pyp  ~  2  xi  ykxj  • 

p  k,p  k 

But  we  have  D  =  Yij  ci,jLiLj ,  so  now 

d=Y  =  YVkY  y Li  Z  xfLi  =  Z v*  w 

k  i  j  k 

□  We  can  also  show  that  [Xj,X7]  =  0,  unless  i  =  j ,  in  which  case  the  commutator  is  just  a 
multiple  of  the  identity.  With  this,  we  can  rewrite  D  to  be  a  sum  of  X^X£  for  k  >  0,  plus 
the  extra  constants  we  pick  up  by  switching  X+  and  X~ .  Since  there  is  an  eigenfunction  f  for 
which  each  X+f  =  0,  we  then  have  that  Df  =  Z&>o  yk^X^f  +  A(p  =  Af.  Thus,  we  have  an 
effective  way  to  compute  the  largest  eigenvalue  of  D. 

3.  Numerics.  To  test  the  stability  of  (1.7)  numerically,  one  could  integrate  x(t)  for  some 
long  time  interval  [0,  T]  using  a  stochastic  integrator,  and  see  if  the  solution  grows.  However, 
this  is  clearly  inefficient  and  slow  if  one  takes  large  T  with  small  time-steps.  We  have  done 
this,  just  for  a  reference  to  test  a  different  technique  described  below.  First,  we  give  the  details 
of  the  numerical  integration  we  used. 

We  will  use  the  more  standard  notation  here  for  stochastic  differential  equations  and  for 
Brownian  motions,  Wt.  For  an  SDE  of  the  form 

dYt  =  a(Y)dt  +  b(Y)dWt  (3.1) 

one  can  construct  a  second- weak-order  scheme  simply  by  writing  down  the  Ito-Taylor  expan¬ 
sion  for  Yt  and  taking  the  first  several  terms.  In  doing  so  the  scheme  is 

Yn+ 1  =  Y„  +  anh  +  bnAWh  +  l-bnb'n  (A W2h  -  h) 

+  a'nbnVn  +  -  [ana'n  +  ^a'^b^jh2  (3.2) 

+  [anb'n  +  \KK\ (AWhh  -  Vn) 

where  h  is  the  time-step,  an  -  a(Yn ),  etc.,  and  Vn  is  a  process  generated  such  that  the  covari¬ 
ance  of  A  Wh  and  Vn  is 


<AW,,v„> 


h  \h2 
\h2  \  h2 


This  scheme  is  discussed  in  detail  in  [4],  [2]. 
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From  now  on,  we  will  assume  we  are  in  the  case  of  v  =  p  =  0  and  a  =  a,  b  =  —J3.  We 
use  (3.2)  to  integrate  s(t)  from  (1.7)  on  some  interval  [0,  T].  Then  the  equation  for  u  is  just 
u  =  -au  +  j3s(t)  where  s  is  know  on  [0,  T].  The  solution  to  this  equation  is  just 

u(t)  =  e~atuo  +  J3e~at  f  eaTs(r)dr. 

Jo 

Using  a  trapezoid  method  to  calculate  the  integral,  we  get  a  second-order  accurate  representa¬ 
tion  of  u  on  [0,  T] .  Now  with  s  and  u  both  known  we  can  integrate  x  =  —yx  +  s{as{t)  -j3u(t))x 
to  second  order  with  a  Crank-Nicolson  method 

1  +  \(-y  +  £(asn-  Pun)) 

Xyi-\- 1  —  7 

1  -  f  (-y  +  s(asn+ 1  -fJu„+i)) 

The  alternative  method  is  derived  as  follows.  Write  the  PDE  dtM  =  DM  and  take 
moments  in  u.  That  is,  multiply  by  un  and  integrate  with  respect  to  du  so  that  the  only  variable 
left  is  s.  Now,  this  will  give  an  infinite  a  system  of  ODEs  for  Mn(s,  t)  =  JR  unM(s ,  u ,  t)du  = 
f  unx2P(s ,  u ,  v,  t)dxdu.  This  gives  the  system 

<x2  9 

dtMn  =  —dlsMn  +  Kds(sMn )  -  naMn  +  n/3sMn_ i 

+  2sasMn  -  2sj3Mn+i  -  2yMn  (3.3) 

for  n  >  0.  We  seek  Mn(s,  t )  =  ^w^s),  and  expand  the  wn  in  terms  of  products  of  e~Ksl 
and  Hermite  polynomials.  We  will  use  the  notation  <pj(s)  =  e~KS Hj(  y/ics),  where  Hj  is  the 
7-th  Hermite  polynomial.  Then  wn(s)  =  YjJ=o  and  one  can  easily  check  that  ^<9207  + 

Kds(s(pj )  =  —  jK(p j.  Also,  the  Hermite  polynomials  satisfy  /77+i(v)  -  2xHj  +  2 7‘///-_i(jc)  =  0,  so 
that  we  can  write  s(pj(s)  =  ^(0;+i  +  2 70/_i). 

Thus  (3.3)  can  be  written  as 

~  na)a"c/)j  +  +  2j'0_,-_i) 

j  7 

+  £-^-anj{<Pj+ 1  +  2y07_i)  +  +2 £^+10/  -  2yanj(f)j.  (3.4) 

j  j 

The  equation  for  the  coefficients  of  (pj  becomes  becomes 

Aanj  =  -( Jk  +  na  +  2  y)anj  +  +  2  (7  +  l)a"+J)  (3.5) 

2  \K 

+  s—-{an-_i  +  2(j  +  1)^”+1)  +  +2  £panAl.  (3.6) 

If  we  write  truncate  the  system  to  0  <  n  <  N\  and  0  <  j  <  N2,  then  we  can  approximate 
(3.5)  by  an  N1N2  x  N1N2  matrix  eigenvalue  problem,  Tw  =  Qw.  The  matrix  Q  will  be 
block  tri-diagonal,  and  w  is  a  column  vector  with  components  w  =  (a1,  a2, ... ,  a^1)*  where 
slh  =  ia'l ,  <z",  •  •  •  aN2)-  This  can  be  solved  very  quickly  and  converges  for  even  small  values  of 
N\  and  N2. 

We  can  also  solve  the  initial  value  problem  w  =  Qw.  In  particular,  if  we  write  w(0  = 
Yjin  cm($meAmt  where  Om  are  the  eigenvectors  of  Q,  then  we  can  find  the  long-time  behav¬ 
ior  of  Mxx(t )  by  looking  at  the  first  component  of  w.  To  see  this,  recall  that  Mn(s,t )  = 
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x2unP(x ,  u ,  s,  t)dxdu  =  Yjj  anj<Pj(s)  and  that  we  are  interested  in  the  second  moment  in  x, 
which  is  Mxx(t)  =  ^  x2P(x,  u ,  s,  t)dxduds.  So  if  we  have  an  expression  for  M0(s,  t),  we  can 
integrate  it  Mxx(t)  =  J^M0(s,t)ds.  Because  the  Hermite  polynomials  Hj(  ^s)  are  orthogonal 

with  respect  to  the  weight  e~Ks2  and  Hq  =  1  is  constant,  we  have  (pj(s)ds  =  ^  for  j  =  0 

and  vanishes  for  j  >  0.  Therefore,  once  we  have  solved  w  =  Qw,  the  function  ^Wo(0  will 
be  a  good  approximation  to  Mxx.  In  practice,  we  can  even  truncate  the  system  further  once 
we  know  the  eigenvalues.  Since  eigenvalues  Tm  with  negative  real  part  will  play  little  role 
for  large  times  (and  it  is  the  large  time  limit  we  are  interested  in)  we  can  simply  approximate 
w0(0  by  the  terms  in  the  series  corresponding  to  the  eigenvalues  with  the  largest  real  part. 
Two  examples  are  show  in  Figures  3.1(a)  and  3.1(b)  for  different  values  of  the  parameter 
s.  The  solid  lines  are  the  results  of  the  stochastic  integrator  plotted  against  t  e  [0,  T]  with 
T  =  10,  and  the  dotted  lines  are  the  results  of  the  truncated  initial  value  problem  against  t. 
The  critical  value  of  s  for  those  parameter  values  is  £  =  0.0165  . . .,  so  in  fact  the  solution  is 


(b)  s  =  0.07 


Fig.  3.1.  The  horizontal  axis  is  time,  t  e  [0, 10].  The  parameter  values  on  both  plots  are  a  =  0.3,  k  =  0.5,  y 6  - 
l,y  =  0.01,  cr  =  1 


unstable  in  both  cases.  This  is  easily  seen  in  Figure  3.1(b),  but  is  not  apparent  in  Figure  3.1(a). 
If  the  analytic  expression  for  Anj  from  Theorem  2.5  were  unknown  to  us,  then  we  might  not 
know  that  both  cases  are  unstable.  To  see  the  instability  using  the  stochastic  integrator,  one 
would  have  to  integrate  for  a  much  larger  value  of  T.  However,  when  solving  the  initial  value 
problem,  we  find  the  largest  eigenvalues  of  Q  and  hence  for  D,  which  in  the  case  £  =  0.02  is 
A  *  0.0094 

4.  Conclusions  and  Future  Work.  We  have  developed  a  method  for  determining  the 
stability  of  (1.7)  that  generalizes  to  higher-order  filters.  The  stability  boundary,  defined  as  the 
solution  to  To  =  0,  where  To  is  the  largest  eigenvalue  of  D  has  an  analytic  expression.  In  the 
case  of  first  and  second-order  filters,  this  is  achieved  by  solving  (2.8)  and  (2.10)  for  £.  The 
numerical  methods  that  we  have  described  generalize  naturally  to  higher-order  differential 
equations.  In  particular,  they  are  applicable  to  the  Mathieu  equation  (1.1).  We  have  used 
them  to  determine  the  stability  of  (1.1),  and  found  they  match  the  results  from  perturbation 
theory.  We  plan  to  investigate  whether  the  ladder  operators  exist  for  higher-order  differential 
equations,  such  as  (1.1).  If  there  are  ladder  operators  in  this  case,  it  is  not  clear  if  they 
completely  determine  the  stability  as  in  the  case  of  the  first-order  equation. 


T.J.  Blass  and  L.A.  Romero 


123 


REFERENCES 

[1]  L.  Arnold,  Stochastic  differential  equations  as  dynamical  systems ,  in  Realization  and  modelling  in  system 

theory  (Amsterdam,  1989),  vol.  3  of  Progr.  Systems  Control  Theory,  Birkhauser  Boston,  Boston,  MA, 
1990,  pp.  489-495. 

[2]  S.  Asmussen  and  P.  W.  Glynn,  Stochastic  simulation:  algorithms  and  analysis ,  vol.  57  of  Stochastic  Modelling 

and  Applied  Probability,  Springer,  New  York,  2007. 

[3]  P.  A.  M.  Dirac,  The  Principles  of  Quantum  Mechanics,  Oxford,  at  the  Clarendon  Press,  1947.  3d  ed. 

[4]  P.  E.  Kloeden  and  E.  Platen,  Numerical  solution  of  stochastic  differential  equations,  vol.  23  of  Applications  of 

Mathematics  (New  York),  Springer- Verlag,  Berlin,  1992. 

[5]  H.  Lamb,  Hydrodynamics,  Cambridge  Mathematical  Library,  Cambridge  University  Press,  Cambridge, 

sixth  ed.,  1993.  With  a  foreword  by  R.  A.  Caflisch  [Russel  E.  Caflisch]. 

[6]  N.  G.  van  Kampen,  Stochastic  processes  in  physics  and  chemistry,  vol.  888  of  Lecture  Notes  in  Mathematics, 

North-Holland  Publishing  Co.,  Amsterdam,  1981. 


CSRI  Summer  Proceedings  2010 


124 


COMPARISON  OF  SENSITIVITY  ANALYSIS  METHODS  FOR  NUCLEAR 
REACTOR  NEUTRONICS 

W.  CYRUS  PROCTOR* * *,  BRIAN  M.  ADAMS  CRISTIAN  RABITF,  AND  HANY  S.  ABDEL-KHALIK§ 

Abstract.  This  preliminary  study  compares  adjoint-based  local  sensitivity  analysis  to  global  sensitivity  methods 
on  a  simplified  nuclear  reactor  neutronics  model.  Sensitivity  analysis  methods  manipulate  computational  models 
to  investigate  how  variations  in  input  parameters  affect  system  responses  of  interest.  For  linear  models  involving 
large  numbers  of  parameters  and  few  responses,  adjoint  methods  efficiently  yield  accurate  local  sensitivities.  As 
nonlinearities  are  introduced,  the  associated  computational  burden  grows,  at  best,  in  proportion  to  the  number  of 
input  parameters  to  estimate  all  cross-derivative  terms.  For  strongly  nonlinear  models,  global  sensitivity  analysis 
is  more  appropriate  for  estimating  parameter  importance,  as  it  can  account  for  nonlinear  interactions  and  response 
variations  over  the  whole  admissible  parameter  range,  not  solely  at  nominal  values.  However,  as  the  number  of 
parameters  increases,  global  methods  become  computationally  infeasible.  This  work  exercises  local  and  global 
methods  on  a  neutron  diffusion  simulation  of  a  sodium-cooled  fast  reactor,  considered  in  a  linear  regime  with  a 
modest  number  of  input  parameters.  It  compares  the  relative  rankings  and  strength  of  influence  resulting  from  both 
approaches.  Future  work  will  develop  hybrid  methods  to  simultaneously  address  the  curse  of  dimensionality  and 
nonlinearities,  and  apply  them  to  a  simulation  of  nonlinear  reactor  phenomena  with  many  input  parameters. 


1.  Introduction.  Advances  in  computational  systems  and  simulations  enable  ever-more 
efficient  and  complex  calculations  for  nuclear  reactor  phenomena  such  as  neutronics,  thermal- 
hydraulics,  structures,  or  plant  dynamics.  Of  particular  interest  to  the  nuclear  engineering 
community  is  reactor  core  simulation  in  novel  operating  regimes  [7].  Such  simulations  re¬ 
quire  specification  of  many  governing  parameters  describing  physics,  chemistry,  or  the  envi¬ 
ronment/operating  conditions.  It  is  crucial  to  understand  how  system  responses  are  influenced 
by  input  data  as  experiments  to  improve  these  data,  reducing  uncertainty,  can  be  exception¬ 
ally  expensive  and  must  be  carefully  selected  and  designed  [2].  Sensitivity  analysis  of  model 
outputs  with  respect  to  input  parameters  critically  informs  the  selection  process  for  experi¬ 
mental  data  collection  and  can  make  uncertainty  quantification  more  computationally  feasible 
by  propagating  only  the  most  crucial  factors. 

Nuclear  reactor  core  neutronics  simulations  (the  target  of  the  present  work)  rely  primar¬ 
ily  on  microscopic  cross-section  data  to  estimate  the  rates  of  interaction  between  neutrons  and 
various  reactor  materials.  Cross-section  data,  which  are  experimentally  evaluated,  are  com¬ 
plicated  functions  of  neutron  energy,  reaction  type,  and  reactor  materials.  To  characterize 
them  in  raw  form  would  lead  to  impractical  computational  cost.  Therefore  reactor  physicists 
appeal  to  homogenization  techniques  to  reduce  the  complexity  of  cross-section  data  [12].  Our 
work  employs  a  simplified  core  model  with  cross-section  data  describing  interaction  rates  for 
45  isotopes,  each  with  5  reaction  types,  each  with  33  energy  groups  (a  group  is  an  energy 
range  over  which  cross-sections  are  assumed  constant).  Uncertainties  in  these  cross-section 
data  are  a  major  source  of  uncertainty  for  reactor  physics  simulation  [6]  and  therefore  are  the 
focus  of  this  study. 

Historically  for  neutronics,  sensitivity  analysis  is  a  common  first  step  in  the  identification 
of  input  data  uncertainty  contributions  to  response  uncertainty.  In  particular,  adjoint-based 
methods  for  uncertainty  propagation  are  commonly  used  for  steady  state  reactor  physics 
models  because  in  steady  state,  key  performance  metrics  of  interest  to  design  and  operation 
(such  as  k-eigenvalue,  peak  power,  or  reactivity  coefficients)  behave  almost  linearly  within 
the  range  of  interest  for  cross-sections  variations.  Since  analyses  typically  involve  only  a 
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handful  of  integral  response  metrics  and  orders  of  magnitude  more  cross  section  input  vari¬ 
ables,  adjoint  approaches  compute  the  necessary  derivatives  efficiently,  using  only  marginally 
more  computation  than  a  nominal  forward  calculation. 

In  this  special  linear  case,  measured  experimental  variance-covariance  information  of 
model  input  parameters  is  readily  combined  with  sensitivity  information  to  propagate  mo¬ 
ments  to  obtain  confidence  bounds  on  responses  of  interest.  In  this  method,  only  the  first 
and  second  moments  of  input  probability  density  functions  are  propagated  through  the  model 
to  estimate  the  first  and  second  moments  of  the  responses  [9].  The  accuracy  of  this  method 
depends  on  the  validity  of  the  linearity  assumption  used  in  extrapolating  the  function  value 
and  gradient  data  to  the  entire  range  of  input  parameter  variations  considered.  For  cross- 
sections,  the  range  of  variation  is  set  by  the  accuracy  of  the  experimental  procedure  employed 
to  measure  cross-section  data.  For  steady-state  neutronics  calculations,  it  has  been  reported 
that  models’  deviation  from  linear  behavior  could  be  ignored,  thereby  allowing  the  use  of 
adjoint-based  methods  for  sensitivity  analysis  and  uncertainty  quantification  [6].  As  neu¬ 
tronics  models  are  tightly  coupled  with  other  physics  however,  such  as  thermal  hydraulics, 
and  fuel  performance  codes,  the  linearity  assumption  breaks  down,  and  other  methods  for 
sensitivity  analysis  must  be  explored. 

This  manuscript  numerically  compares  various  sample-based  sensitivity  techniques  avail¬ 
able  in  Sandia  National  Laboratories’  DAKOTA  software  to  the  derivative-based  adjoint  ap¬ 
proach  common  in  nuclear  engineering  applications.  It  will  provide  a  brief  computational 
model  overview  followed  by  general  overviews  of  the  sensitivity  techniques  considered.  Re¬ 
sults  in  the  form  of  relative  sensitivity  coefficients  are  presented  for  comparison.  Additional 
insights  from  simple  and  partial  rank  correlations  as  well  as  Sobol  indices  are  discussed. 

2.  Reactor  Neutronics  Simulation.  The  sodium-cooled  fast  reactor  model  of  interest  in 
this  work  is  simulated  with  a  FORTRAN  90  code  developed  specifically  for  this  comparison. 
It  utilizes  point-wise  Gauss-Seidel  with  successive  over  relaxation  to  solve  a  multi-group 
neutron  diffusion  equation  in  both  forward  and  adjoint  modes.  Consider  the  forward  form  of 
equations,  [3]  for  energy  groups  g  =  1, . . . ,  G 

G  g—  1  G 

~£~  Vg'^fg'fig'  +  ^ i  ^s,g' — +  ^  ^s,g' — 
g'= 1  g'= 1  g'=g+ 1 

G  g-1  G 

yg'^fg,(Pg'  +  ^ i  ^s,g' — ^gfig'  +  ^  ^s,g' — ^gfig' 

g'= 1  g'= 1  g'=g+ 1 

(2.1) 

G  G-l  G 

—V  •  DGV(pG  +  TjRG(pG  —  ——  Vg'Hfg'fig'  +  £ s,g '  — ^g&g'  +  ^s,g'  — ^gfig' 

g'= i  g'= i  g'=G+ 1 

where  the  spatial  dependence  r  has  been  suppressed,  and  the  physical  parameters  are  defined 
in  Fig.  2.1. 

The  removal  cross  section  contains  total  and  within  group  scatter  terms,  while  the  right- 
hand  side  would  include  the  fission,  downscatter  and  upscatter  terms,  k  is  called  the  mul¬ 
tiplication  factor  which  balances  the  system  of  equations  such  that  the  number  of  neutrons 
produced  equals  the  number  of  neutrons  lost.  A  multiplication  factor  equal  to  one  implies 
that  the  associated  reactor  system  is  in  steady  state,  i.e.  capable  of  self-sustained  production 
of  neutrons. 


-V  •  D\W(pi  +  £^01  - 

-V  •  D2V02  +  = 
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diffusion  coefficient  [cm] 
neutron  flux  [cm~2  sec~l] 
macroscopic  removal  cross  section  [cm~l] 
fission  neutrons  yield  [-] 

multiplication  factor  (inverse  of  largest  eigenvalue)  [-] 
average  number  of  neutrons  created  per  fission  [-] 
macroscopic  fission  cross  section  [cm~l] 
macroscopic  group  to  group  scattering  cross  section  [cm~1]. 


Fig.  2.1.  Physical  parameters  for  the  neutron  diffusion  equations. 


Specifically  for  cylindrical  (r,  z)  geometry  we  may  write  for  energy  group  g 


1l  d  d  d  ^  \ 

-  —  rDs  (r,  z)  —  +  —  Ds  (r,  z)  — )  <ps  (r,  z) 
r  or  or  oz  oz ) 

+  (e,  (r,  z)  -  T,Sig^g  (r,  z))  <pg  (r,  z)  =  S8  (r,  z) , 


(2.2) 


where 

s-i  G 

Ss(r,z)  =  2]  Zsg,  g  (; r ,  z)  <p8'  ( r ,  z)  +  ^  T.sg,  _ ^ g  (r,  z)  <p8'  (r,  z) 

g'  =  1  #'=£+! 

+  <-r’  ^  2/  ^r’  ^  ^  ^r’ 

«'=! 

Equation  (2.2)  was  discretized  using  a  Finite- Volume  scheme,  and  the  resulting  system  of 
equations  may  be  written  matrix  form  as: 

l<p  = 

and  solved  via  r-line  ordering  starting  from  r  =  0  to  r  =  R  and  z  =  0  to  z  =  H.  Thus,  the 
corresponding  eigenvalue  system  may  be  transformed  into  the  iterative  solution  of  G  fixed- 
source  matrix  systems  via  a  power  iteration  technique  to  obtain  the  dominant  eigenvalue  A 
and  corresponding  eigenvector  0  [3]. 

This  eigenvalue  system  may  be  written  in  terms  of  a  broader  general  framework  designed 
to  encompass  not  only  eigensy stems  but  also  fixed-source  problems: 

Afi=  Q, 

where  A  =  L  -  and  Q  -  0  in  this  case.  Utilizing  the  variational  method  of  [8],  we 
may  consider  the  system  as  constraints  on  the  output  response  R.  These  constraints  may  be 
incorporated  with  the  response  using  Lagrange  multipliers  to  create  an  augmented  functional 
T : 


T  =  R-  <  Z(A(p  -  Q )  >, 
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where  Z  is  the  Lagrange  multiplier.  A  change  in  some  input  parameter  a  results  in  T  —>  T' . 
We  may  expand  T'  via  first  order  Taylor  series  expansion  about  the  reference  state  to  obtain 


r 


d  T 

T+  <  —A  Z  + 
dZ 


dT 

da 


A  a  >  . 


The  perturbation  in  the  response  may  be  rewritten  as 


dT  A  dT  A  dT  A 

A R  =<  —A a  >  +  <  —  AZ  >  +  <  —  A0  >  . 
da  dZ  <90  Y 


If  we  wish  T  to  be  stationary  to  changes  in  the  Lagrange  multiplier  we  must  set  A0  -2  =  0, 
which,  for  the  case  of  our  eigensystem,  is  true.  If  we  wish  T  to  be  stationary  to  changes  in 
the  neutron  flux  we  must  choose  Z  to  be  the  adjoint  neutron  flux  (Z  =  0*)  which  satisfies  the 
equation 


dR 

A*  ‘w 


Thus,  the  perturbation  of  a  general  response  R  due  to  a  change  in  parameter  a  is  given  as 

(2.3) 


dT  dR  *<9A  dQ 

A R  =<  —A a  >=<  (- - (p  —cp  +  (p  — )A a  >  . 

da  da  da  da 


The  adjoint  flux  physically  denotes  the  importance  of  neutrons  in  the  phase  space  to  the  re¬ 
sponse  of  interest.  ^  is  determined  based  on  the  response,  implying  that  the  adjoint  equation 
must  be  solved  independently  for  every  response  of  interest.  This  places  a  limitation  on  the 
use  of  adjoint  methods  for  problems  involving  many  responses.  In  this  paper  we  focus  only 
on  the  multiplication  factor  as  a  response,  for  which  ^  =  0.  Carrying  out  a  first-order  ap¬ 
proach,  an  estimate  of  the  change  in  the  multiplication  factor  (R  =  k)  due  to  a  change  in  the 
cross-section  data  cr  is  given  by: 


dk  %>t» 

dcr  <  0*,  F(p  > 

This  result  may  be  compared  directly  to  a  finite-difference  approach  for  derivative  approxi¬ 
mation.  Equation(2.4)  may  also  serve  as  the  basis  for  a  relative  sensitivity  coefficient  between 
the  output  response  k  and  the  input  parameters  cr 


S  (k,  cr)  = 


dk 

k 

dcr 


=  -kcr 


<r>(£ 


{%)<!>> 


<  0*,  F(f>  > 


(2.5) 


3.  Global  Sensitivity  Analysis.  In  contrast  to  local  sensitivity,  the  goal  of  global  sensi¬ 
tivity  analysis  is  to  assess  the  influence  of  input  parameters,  considered  over  their  whole  pos¬ 
sible  range,  on  output  responses.  Such  an  approach  is  typically  used  to  rank  the  importance  of 
the  input  factors,  determine  the  effect  of  their  variance  on  the  variance  of  the  output,  or  assess 
whether  higher-order  interactions  between  parameters  affect  output  responses  [10].  Global 
sensitivity  and  uncertainty  analysis  methods  may  offer  additional  problem  insight  when  re¬ 
sponse  linearity  as  a  function  of  input  variables  is  violated.  Sampling-based  approaches  to 
sensitivity  and  uncertainty  analysis,  such  as  Latin  hypercube  sampling  (LHS),  can  be  very 
robust  even  in  the  presence  of  strong  nonlinearity,  but  can  be  computationally  expensive  for 
screening  studies,  where  (10  x  number  of  input  variables)  evaluations  of  the  model  are  typi¬ 
cally  used.  An  advantage  of  sampling,  however,  is  that  it  can  be  applied  without  modifying  the 
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solver  (simulator);  this  favorable  characteristic  is  also  true  for  other  “black-box”  approaches 
to  sensitivity  and  uncertainty  analysis,  such  as  design  and  analysis  of  computer  experiments 
(DACE),  reliability  analysis,  and  stochastic  expansion  methods  such  as  polynomial  chaos  and 
stochastic  collocation. 

Global  sensitivity  analysis  methods  typically  identify  an  ensemble  of  well  distributed 
points  in  the  input  variable  space,  evaluate  the  computational  model  at  these  points,  and 
perform  statistical  analysis  of  the  resulting  function  values  (and  possibly  derivatives,  if  avail¬ 
able).  Sandia’s  DAKOTA  (Design  Analysis  Kit  for  Optimization  and  Terascale  Applications) 
includes  numerous  algorithms  for  sensitivity  analysis  and  uncertainty  quantification,  includ¬ 
ing  those  briefly  described  here  (for  further  details,  consult  the  DAKOTA  User’s  Manual  [1]). 
DAKOTA  provides  a  flexible,  extensible  interface  to  any  analysis  code,  permitting  its  ready 
use  for  global  sensitivity  analysis  of  a  fast  reactor  simulated  with  a  neutronics  model. 

3.1.  Latin  hypercube  sampling  and  correlation.  Latin  hypercube  sampling  (LHS)  is 
among  the  most  robust,  ubiquitous,  and  accepted  global  sampling  and  analysis  techniques, 
which  include  other  sampling  methods  such  as  standard  Monte  Carlo,  quasi-Monte  Carlo, 
orthogonal  arrays,  and  jittered  sampling.  It  relies  on  a  probabilistic  characterization  of  input 
uncertainties  (cross  section  deviations  in  the  present  context),  from  which  realizations  of 
the  input  variables  are  generated  for  model  evaluation,  and  then  statistical  analysis  on  the 
corresponding  response  values  can  be  performed.  LHS  typically  resolves  statistics  with  fewer 
samples  than  standard  Monte  Carlo  and  can  generate  sample  designs  respecting  input  variable 
correlation  structure  [15].  DAKOTA  reports  the  mean,  standard  deviation,  and  coefficient  of 
variation  of  each  response  (together  with  confidence  intervals  based  on  the  number  of  samples 
used)  and  correlation  coefficients  (both  on  the  data  and  on  their  ranks).  Lor  example,  a  simple 
(Pearson)  correlation  between  output  y  and  input  v  is  given  by 

£;(•*;  -  x)(yi  -y)  „  , , 

Px,y  =  (3.1) 

yfZi  (Xi  -  x)2  Yi  O';  -  W 

whereas  partial  correlation  coefficients  adjust  for  the  effects  of  other  variables.  The  results 
presented  here  focus  on  the  simple  and  partial  correlation  coefficients,  which  are  scaled  be¬ 
tween  - 1  and  1 .  Larger  absolute  magnitudes  indicate  stronger  linear  relationship  between  the 
input  and  output  (see  [1]).  Additional  statistical  techniques  (such  as  regression  analysis)  can 
also  be  used  to  analyze  the  parameter/response  pairs  resulting  from  an  LHS  study  [13]. 

3.2.  Variance-based  decomposition.  Variance-based  decomposition  (VBD)  summa¬ 
rizes  how  model  output  variability  can  be  attributed  to  variability  in  individual  input  variables. 
This  relationship  is  captured  in  a  main  effect  sensitivity  index 


VarXi  [E(Y\Xi)] 
Var(Y ) 


(3.2) 


which  reflects  the  fraction  of  output  uncertainty  attributable  to  Xi  alone,  and  the  total  effect 
index 


T  =  E[Var(Y\x-i)]  =  Var(Y)  -  Var  {E[Y\x.{\ 

1  Var(Y)  Var(Y)  9  (  ' 

where  X-t  indicates  variable  i  is  omitted  from  the  vector  of  input  variables,  which  accounts 
for  variability  due  to  xt  and  its  interactions  with  other  input  variables.  Larger  values  of  these 
Sobol  indices  indicate  a  stronger  influence  of  an  input  on  variance  of  the  output.  The  sum  of 
main  effect  indices  is  less  than  or  equal  to  1  (and  equal  to  1  for  a  linear  model),  whereas  the 
total  effect  indices  need  not  be.  See  [10]  and  [1 1]  for  further  information. 
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For  d  input  parameters,  VBD  requires  the  evaluation  of  d-dimensional  integrals  and, 
when  implemented  with  replicates  in  sampling,  typically  requires  d  +  2  replicates  of  N  LHS 
samples.  As  this  can  be  prohibitively  expensive,  even  for  tens  of  variables,  the  sensitivity 
indices  are  often  calculated  based  on  a  surrogate  model  or  polynomial  chaos  approximation. 

3.3.  Surrogate  model  acceleration.  Another  use  of  LHS  sample  designs  is  to  construct 
global  surrogate  models,  also  referred  to  as  response  surfaces  or  meta-models.  For  instance,  a 
modest  number  of  evaluations  (typically  on  the  order  of  2-10  times  the  number  of  input  vari¬ 
ables)  of  the  computational  model  can  be  used  to  train  a  Kriging  (Gaussian  process),  MARS, 
or  artificial  neural  network  model  [5].  This  surrogate  model  is  comparatively  inexpensive  to 
evaluate  and  can  be  sampled  tens  or  hundreds  of  thousands  of  times  to  calculate  correlation 
coefficients  or  Sobol  sensitivity  indices. 

Polynomial  chaos  expansions  (PCE)  globally  approximate  the  output  y  as  a  function  of 
input  random  variables  x: 


p 


(3.4) 


7=0 


where  orthogonal  polynomials  if/j  are  selected  to  yield  optimal  convergence  of  the  approx¬ 
imation  [4].  Specifically,  they  are  chosen  to  be  orthogonal  with  respect  to  the  probability 
distribution  of  the  inputs  x  with  the  same  support,  e.g.,  Hermite  polynomials  are  used  with 
normal  random  variables  whereas  a  Legendre  basis  is  optimal  for  uniform.  The  coefficients 
ccj  can  be  calculated  with  spectral  projection  and  multi-dimensional  integration  or  regression. 
Here,  sparse-grid  quadrature  and  cubature  integration  techniques  for  PCE  are  considered. 

Once  constructed,  a  PCE  can  again  be  exhaustively  sampled,  but  often  statistics  of  in¬ 
terest  can  be  calculated  analytically  using  the  structure  of  the  approximation.  Sudret  [14] 
demonstrated  that  Sobol  sensitivity  indices  can  be  calculated  directly  from  a  PCE,  and  that 
approach  as  implemented  in  DAKOTA  [16]  is  used  here. 

4.  Reactor  Core  Model.  The  model  employed  in  this  study  is  a  cylindrical,  axially 
symmetric,  two-region,  and  spatially  homogeneous  approximation  of  the  Sodium-cooled  Fast 
Reactor.  Approximations  in  geometry  and  compositions  are  necessary  to  reduce  the  size  of 
cross-section  data  used  to  describe  the  reactor  core  (the  total  number  of  cross-section  data 
is  7425,  which  is  reduced  from  380952).  The  fuel  region  is  capped  axially  by  two  neutron 
reflectors  to  reduce  neutron  leakage  and  hence  reduce  the  core  size  necessary  for  a  self- 
sustained  reaction.  Figure  4.1  illustrates  the  geometry  and  boundary  conditions  used.  See 
reference  [7]  for  fuel  and  reflector  materials  composition.  Overall  dimensions  were  set  at 
R  —  100  cm,  h  -  75  cm  and  H  =  100  cm.  A  radial  and  axial  mesh  of  30  by  40  nodes  is 
used  throughout  the  comparison.  This  spacing  allowed  for  adequate  resolution  of  the  neutron 
flux  at  the  fuel  interfaces.  The  cross  section  data  are  based  on  an  ERANOS  model  for  a 
Sodium-cooled  Fast  Reactor  also  discussed  in  [7]. 

5.  Comparison  of  global  and  local  sensitivity  approaches. 

5.1.  Analysis  approaches  used.  Several  DAKOTA  SA  methods  as  well  as  the  adjoint- 
based  sensitivity  coefficients  output  from  the  computational  model  were  compared  in  terms 
of  importance  rankings  and  relative  sensitivities.  The  DAKOTA  methods  compared  were 
Latin  hypercube  sampling  (LHS)  with  or  without  variance-based  decomposition  (VBD)  or 
surrogates  (Gaussian  process  or  Kriging  models),  and  polynomial  chaos  expansions  (PCE) 
directly  yielding  Sobol  indices  and  local  sensitivity  estimates.  The  Gaussian  process  model 
used  has  a  quadratic  trend  function,  while  the  Kriging  model  has  a  constant  trend.  While 
exhaustive  numerical  experiments  were  conducted  varying  sampling  size  and  method  type, 
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Fig.  4.1.  Cross  section  of  cylindrical  two-region  layout  of  reactor. 


Table  5.1 

Abbreviations  for  DAKOTA  methods  and  options. 


DAKOTA  Method 

Shorthand 

#  model  evals 

LHS,  180  samples 

LHS  180 

180 

LHS,  1800  samples 

LHS  1800 

1800 

VBD,  180  samples,  20  replicates 

VBD  180 

3600 

PCE,  cubature  level  3 

C3 

36 

PCE,  cubature  level  5 

C5 

649 

PCE,  sparse  grid  level  2 

SG2 

721 

PCE  sparse  grid  level  3 

SG3 

9841 

Kriging,  180  true  samples 
+  1000  surrogate  samples 

KRIG  180 

180 

Gaussian  process,  180  true  samples 
+  1000  surrogate  samples 

GP  180 

180 

only  a  few  are  considered  for  brevity.  Table  5.1  lists  the  DAKOTA  methods/options  presented 
here. 

Global  sensitivity  methods  in  DAKOTA  are  tractable  for  tens  of  variables.  The  relatively 
large  input  space  (7425  parameters)  defined  for  this  comparison  required  choosing  groups 
of  parameters  and  assigning  one  control  for  DAKOTA  to  manipulate.  First,  the  eighteen 
fissioning  isotopes  out  of  the  total  forty-five  isotopes  in  the  reactor  were  selected  as  the  focus 
of  this  comparison.  Then,  the  microscopic  fission  cross  sections  of  each  of  these  isotopes 
were  bundled  together  into  eighteen  independent  meta-parameters.  In  other  words,  for  each 
of  1 8  parameters  DAKOTA  perturbed,  a  total  of  thirty-three  microscopic  cross  section  values 
for  that  specific  isotope  were  perturbed  by  the  same  multiplier. 

A  perturbation  of  thirty-three  cross  sections  in  ensemble  compares  well  with  the  output 
of  the  adjoint-based  method  embedded  in  the  model  due  to  the  predominantly  linear  behavior 
of  the  system.  The  sensitivity  coefficient  for  a  group  of  cross  sections  is  simply  the  sum 
of  the  individual  cross  section  sensitivity  coefficients.  As  an  illustrative  example,  consider 
the  isotope  Plutonium-239  (PU239).  The  DAKOTA  control  parameter  will  perturb  all  thirty- 
three  energy  groups  of  the  fission  cross  section  of  PU239.  As  a  result,  a  relative  sensitivity 
coefficient  for  response  yk  with  respect  to  this  set  of  cross  sections  can  be  designated  as 
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S  (yk,  crpu239, fission, all)-  This  sensitivity  coefficient  may  be  computed  with 

G 

s  (yk,  cr PU239, FISSION, all)  =  ^  S  (yk,  cr pu239,FISSION,g)- 

8= 1 

5.2.  Conversion  of  DAKOTA  Results.  Outputs  from  DAKOTA  and  the  neutronics 
model  may  be  compared  both  qualitatively  via  rankings  of  relative  importances  and  also 
quantitatively  via  relative  sensitivity  coefficients  introduced  in  equation  (2.5).  To  compare 
the  global  statistical  approaches  of  DAKOTA  to  the  local  sensitivities  of  the  adjoint  form 
of  the  model,  the  simple  correlation,  pXj,yk,  between  the  kth  output  response,  yk,  and  the  jth 
input  parameter,  xj,  output  by  DAKOTA  must  be  transformed  to  a  metric  comparable  to  a 
local  gradient.  DAKOTA’S  LHS  based  methods  calculate  Pearson  correlations  in  the  form  of 
Equation  (3.1).  These  are  based  on  a  regression  in  which  the  model  is  assumed  linear.  For 
the  neutronics  model,  the  input  that  DAKOTA  provides  to  the  model  during  sampling  is  a 
multiplier  on  the  nominal  value  of  the  input  parameter  Xj  (relative  variation)  rather  than  the 
absolute  parameter  value.  In  other  words, 


Xj  =  CjXj , 

the  perturbed  input  parameter  Xj  is  derived  from  multiplying  the  nominal  input  parameter 
value  xj  by  a  multiplier  Cj  which  has  been  chosen  as  Cj  ~  N(  1,0.03)  to  represent  roughly 
±10%  variation.  It  follows  that, 


dyk  _  dykdxj^  _  dyk  „ 
dcj  dxjdcj  dxjXi’ 

As  the  number  of  samples  increases,  the  sample  means  converge  to  their  nominal  values  and 
the  relative  sensitivities  may  be  compared  as  follows: 


dyk  Xj 
dxj% 


(5-1) 


where  the  terms  in  the  square  roots  are  the  standard  deviation  of  output  k  over  standard  de¬ 
viation  of  input  multiplier  j.  Similarly,  all  PCE  based  methods  return  the  absolute  local 
sensitivities  (based  on  the  global  polynomial  approximation’s  derivative)  directly  and  need 
only  to  be  scaled  by  the  inverse  of  nominal  output  %.  This  is  possible  since  the  nominal 
values  of  the  Cj  multipliers  are  one. 

Relative  importances  may  be  measured  by  simple  correlations,  partial  correlations,  Sobol 
main  effect  indices  or  total  Sobol  indices.  Based  on  the  absolute  values  associated  with  these 
metrics,  a  larger  value  implies  stronger  correlation  and,  hence,  a  stronger  sensitivity.  Various 
DAKOTA  methods  may  be  compared  based  solely  on  the  rankings  of  these  metrics. 

5.3.  Sensitivity  analysis  results.  Results  here  focus  on  sensitivity  of  the  multiplication 
factor  k.  In  the  discussion,  the  result  from  the  neutronics  model  is  treated  as  a  reference  result 
because  it  provides  an  exact  (to  within  numerical  precision)  estimate  of  local  derivatives. 
The  most  common  application  of  global  SA  is  to  rank  the  importance  of  input  parameters, 
however,  it  should  also  offer  reasonable  predictions  of  local  first-order  sensitivities  for  this 
predominantly  linear  parameter  to  response  mapping.  Table  5.2  shows  rankings  based  on 
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Table  5.2 

Comparison  of  isotope  importance  rankings  as  determined  by  various  methods.  LHS,  Kriging  and  Gaussian 
Process  methods  show  simple  correlations,  whereas  PCE/C3  and  VBD  show  Sobol  index-based  rankings.  Anomalies 
are  indicated  in  blue  italics.  All  methods  omitted  from  results  produced  the  same  rankings  as  the  local  adjoint  and 
LHS  1800  results. 


ADJOINT 

(relative) 

PU239 

PU240 

PU241 

AM242M 

PU238 

PU242 

U238 

CM245 

AM241 

CM244 

NP237 

AM243 

U235 

CM242 

CM246 

U234 

CM243 

U236 


LHS  180 
(correl) 
PU239 
PU241 
PU240 
AM242M 
PU238 
PU242 
CM245 
U238 
CM244 
AM241 
AM243 
U234 
NP237 
U236 
CM242 
CM246 
CM243 
U235 


KRIG  180 
(correl) 
PU239 
PU240 
PU241 
AM242M 
PU238 
PU242 
U238 
CM245 
AM241 
CM244 
NP237 
AM243 
U235 
CM242 
CM246 
U236 
CM243 
U234 


GP  180 
(correl) 
PU239 
PU240 
PU241 
AM241M 
PU238 
PU242 
U238 
CM245 
CM244 
AM241 
NP237 
AM243 
U235 
CM242 
CM246 
CM243 
U236 
U234 


C3 

(Sobol) 

PU239 

PU240 

PU241 

AM242M 

PU238 

PU242 

U238 

CM245 

AM241 

CM244 

NP237 

AM243 

U235 

CM242 

CM246 

U234 

CM243 

U236 


VBD  180 
(Sobol) 
PU239 
PU238 
PU242 
AM242M 
AM243 
CM245 
NP237 
CM246 
CM243 
U236 
CM242 
CM244 
U234 
U235 
AM241 
U238 
PU241 
PU240 


LHS  1800 
(correl) 
PU239 
PU240 
PU241 
AM242M 
PU238 
PU242 
U238 
CM245 
AM241 
CM244 
NP237 
AM243 
U235 
CM242 
CM246 
U234 
CM243 
U236 


relative  sensitivity  coefficient,  simple  correlation,  or  Sobol  index  Si.  Most  methods,  except 
LHS  180  and  VBD  180,  which  can  probably  be  considered  under-resolved,  agree  on  the 
rankings.  Those  that  are  under-resolved  still  get  approximate  ordering  correct.  Taking  180 
LHS  samples  and  building  a  Kriging  model  helps  with  the  ranking,  as  does  directly  using 
1800  LHS  samples. 

The  isotope  importances  given  in  Table  5.2  agree  reasonably  on  physical  grounds  based 
on  the  initial  nuclide  concentration  given  in  [7].  The  loading  for  this  reactor  is  predominantly 
Plutonium-239  and  Uranium-238.  These  two  isotopes  and  their  neutron  capture  daughter 
isotopes,  e.g.  Plutonium-240  or  Plutonium-241,  are  most  crucial  for  reactor  criticality. 

Figure  5.1  offers  a  comparison  of  the  relative  sensitivity  coefficients  and  the  comparable 
DAKOTA-based  measures  derived  in  Section  5.2.  There  is  significant  discrepancy  between 
the  adjoint-based  approach  and  LHS  180,  but  as  the  number  of  samples  increases  to  1800 
the  differences  nearly  vanish.  All  other  methods  (some  requiring  far  fewer  computational 
model  evaluations),  also  capture  these  relative  sensitivities.  When  180  LHS  samples  are 
used  to  construct  a  Kriging  model  (KRIG  180),  results  comparable  to  the  LHS  1800  case  are 
achieved. 

A  somewhat  arbitrary,  but  often  useful,  screening  heuristic  is  that  correlations  greater 
than  0.2  in  magnitude  can  generally  be  considered  relevant  and  those  greater  than  0.5  sig¬ 
nificant.  The  results  for  simple  correlations  shown  in  Figure  5.2  indicate  significant  linear 
relationship  between  a  number  of  isotopes  and  response  k.  The  partial  correlation  coefficients 
shown  in  Figure  5.3  illustrate  that  controlling  for  the  effects  of  other  input  factors  may  reveal 
other  significant  factors,  indeed  all  but  a  handful  are  likely  significant.  The  sample  size  has 
little  influence  on  these  metrics,  reflecting  that  for  a  linear  analysis,  few  samples  are  needed 
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Fig.  5.1.  Relative  sensitivities  (RS)  estimated  by  local  derivatives  compared  to  scaled  global  estimates  from 
DAKOTA  methods. 


to  capture  the  effects. 

Sobol  indices  shown  in  Table  5.3  offer  another  measure  of  variable  importance.  They 
indicate  a  dominant  effect  of  Pu239,  and  marginal  effects  of  Pu240,  Pu241,  and  Am242M. 
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Fig.  5.2.  Simple  correlation  coefficients  calculated  using  180  and  1800  LHS  samples. 


6.  Conclusions.  This  exploratory  comparison  of  DAKOTA  and  adjoint  methods  for 
a  linear  neutronics  model  will  serve  as  the  foundation  for  future  development  of  hybrid 
global/local-adjoint  methods  to  combat  the  curses  of  nonlinearity  and  dimensionality.  Reas¬ 
suringly,  the  comparisons  of  the  qualitative  rankings  and  quantitative  sensitivity  coefficients 
between  local  adjoint  and  DAKOTA  methods  employed  for  this  study  converge  as  the  number 
of  function  evaluations  increase.  In  as  few  as  36  function  evaluations  for  PCE-based  methods, 
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Fig.  5.3.  Partial  correlation  coefficients  calculated  using  180  and  1800  LHS  samples. 


the  Sobol  indices  and  sensitivity  coefficients  were  appropriately  captured  as  compared  with 
the  adjoint-based  approach  used  in  the  neutronics  model.  Similarly,  the  surrogate-enhanced 
Kriging  model  successfully  captured  almost  all  rankings  and  sensitivity  coefficients  within 
1.3%  with  180  function  evaluations.  These  methods  show  promise  in  capturing  main  effects 
while  they  are  known  to  be  less  likely  to  misinform  in  the  presence  of  nonlinearity,  see  [1 1], 
pp.  10-12,  16-33,  and  references  therein,  though  with  a  significant  increase  in  computational 
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Table  5.3 

Sobol  total  effects  Ti  and  main  effects  Si  for  each  isotope  as  calculated  with  BCE,  cubature  level  5  (all  other 
stochastic  expansion  methods  produced  identical  results). 


Isotope 

Ti 

Si 

AM241 

0.001 

0.001 

AM242M 

0.030 

0.030 

AM243 

0.000 

0.000 

CM242 

0.000 

0.000 

CM243 

0.000 

0.000 

CM244 

0.001 

0.001 

CM245 

0.002 

0.002 

CM246 

0.000 

0.000 

NP237 

0.000 

0.000 

PU238 

0.015 

0.015 

PU239 

0.806 

0.806 

PU240 

0.072 

0.072 

PU241 

0.067 

0.067 

PU242 

0.004 

0.004 

U234 

0.000 

0.000 

U235 

0.000 

0.000 

U236 

0.000 

0.000 

U238 

0.002 

0.002 

cost  (lOx  to  lOOx  due  to  required  forward  runs  of  the  model).  However,  their  scalability 
in  number  of  parameters  is  limited,  so  ongoing  research  is  investigating  the  use  of  adjoint¬ 
generated  derivative  information  in  global  sensitivity  approaches. 
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Meshing  and  Optimization 

The  articles  in  this  section  have  an  over  arching  theme  of  optimization.  The  first  three 
articles  use  optimization  technology  to  improve  the  quality  of  meshes  used  in  finite  element 
analysis.  The  remaining  articles  focus  on  practical  implementation  of  various  optimization 
strategies. 

Voshell  et  al.  describe  a  mesh  untangling  algorithm  using  the  target  matrix  paradigm  for 
quadratic  elements.  The  authors  present  results  using  quadratic  triangles,  quadrilaterals,  tetra¬ 
hedrons  and  hexahedron  that  the  show  the  algorithm  is  beneficial.  Franks  and  Knupp  present 
a  new  metric  for  quadratic  2D  mesh  untangling.  Using  the  target  matrix  paradigm  the  authors 
present  numerical  results  that  indicate  comparable  performance  to  previously  developed  met¬ 
rics.  Park  et  al.  consider  the  effect  different  vertex  orderings  have  on  the  performance  of  a 
local  mesh  smoothing  algorithm.  Timings  varied  by  a  factor  of  2,  with  no  clearly  optimal 
strategy.  However  several  strategies  were  seen  as  appropriate  for  an  all-purpose  ordering. 
Berwald  et  al.  summarize  two  projects.  The  first  explores  using  tools  from  computational 
topology  to  estimate  the  local  density  of  points  in  an  attractor.  The  second  describes  the  ex¬ 
tension  of  the  TEVA-SPOT  software  suite.  Steele  and  Watson  describe  an  implementation  of 
the  Benders  decomposition  for  large  scale  linear  programming  in  Pyomo.  Hunter  et  al.  con¬ 
sider  the  use  of  stochastic  optimization  for  energy  economy  optimization  models.  Orsini  and 
Gray  discuss  an  ACRO  implementation  of  the  EAGLS  algorithm  for  hybrid  optimization. 

E.C.  Cyr 
S.S.  Collis 
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QUADRATIC  ELEMENT  MESH  UNTANGLING  AND  SHAPE  OPTIMIZATION 
VIA  THE  TARGET-MATRIX  PARADIGM 

NICHOLAS  VOSHELL*,  PATRICK  KNUPP',  AND  JASON  KRAFTCHECK* 

Abstract.  When  meshes  containing  high-order  elements  are  inverted,  it  is  often  due  to  the  projection  of  high- 
order  nodes  onto  the  domain  boundary.  Addressing  the  inversion  requires  either  re-meshing  or  post-processing  the 
mesh.  The  Target-matrix  paradigm  (TXP)  for  mesh  optimization  provides  a  framework  we  use  to  develop  a  node¬ 
movement  post-processing  method  that  addresses  meshes  containing  inverted  quadratic  elements.  Our  proposed 
algorithm  includes  two  optimization  stages;  one  stage  uses  a  non-barrier  size  quality  metric  to  untangle  the  mesh, 
and  the  second  stage  uses  a  barrier  shape  quality  metric  to  improve  shape  quality  while  preventing  the  mesh  from 
inverting.  The  method  is  tested  on  non-hybrid  meshes  with  four  element  types:  triangles,  quadrilaterals,  tetrahedra, 
and  hexahedra.  Overall,  the  algorithm  proved  to  be  beneficial,  although  there  is  no  guarantee  that  the  optimized 
mesh  will  be  untangled. 


1.  Introduction.  Quadratic  finite  elements  are  sometimes  used  to  achieve  high  accuracy 
simulations.  Such  meshes  are  usually  created  by  first  creating  elements  with  straight  sides  and 
then  curving  boundary  sides  by  projecting  mid-side  and  mid-face  nodes  onto  the  bounding 
geometry  [15].  This  approach  creates  boundary-conforming,  quadratic  element  meshes  on 
complex  geometries  in  which  the  resulting  boundary  elements  can  be  poor  or  even  inverted, 
especially  if  elements  are  large  compared  to  the  associated  geometry  curvature.  The  poor 
quality  elements  can  impact  simulation  accuracy,  reduce  efficiency  and,  if  the  Jacobian  is 
negative,  even  prevent  proceeding  with  calculations. 

Finite  element  methods  are  based  on  a  local  map  from  a  logical  element  to  a  given  mesh 
element.  The  element  is  usually  considered  inverted  if  and  only  if  there  exists  some  point  p 
in  the  logical  element  such  that  det(J)  <  0,  where  J  is  the  Jacobian  matrix  of  the  map.  This 
test  is  relatively  simple  for  linear  elements  because  the  determinant  is  linear  in  the  logical 
coordinates.  Quadratic  and  other  higher-order  elements  are  more  difficult  to  determine  if 
they  are  inverted.  Convergence  theory  for  finite  elements  requires  that  the  mesh  contain  no 
inverted  elements. 

There  have  been  limited  studies  into  untangling  or  improving  element  shape  qualities 
within  meshes  that  include  high-order  elements.  In  [17,  18],  the  authors  derive  the  re¬ 
gions  that  mid- side  nodes  can  be  placed  to  ensure  that  elements  are  non-inverted.  Others, 
in  [13,  14],  have  looked  into  untangling  and  improving  quality  of  high  order  meshes  using 
vertex  insertion  and  removal,  flipping  of  edges  and  faces,  along  with  other  topological  mesh 
modifications.  In  [16],  the  authors  investigate  extending  quality  metrics  to  assess  the  quality 
of  quadratic  elements.  Approaches  using  node-movement  based  on  optimization  appear  to 
be  limited  to  [2,  3,  12].  One  approach  to  extend  linear  quality  metrics  using  an  angle-based 
penalty  term  was  used  in  [2,  3].  They  smoothed  triangle  and  quadrilateral  meshes  but  are 
limited  to  isotropic  two-dimensional  meshes.  This  work  continues  and  extends  the  work  of 
[12]. 

The  present  work  focuses  not  on  detecting  whether  an  element  is  inverted,  but  instead 
on  optimizing  shape  quality  such  that  inverted  elements  rarely  occur.  This  work  follows  a 
node  movement  strategy  called  the  Target-matrix  paradigm  [11].  Topological  techniques  are 
recognized  to  be  quite  powerful  and  it  is  hoped  that  they  can  eventually  be  combined  with 
effective  node  movement  strategies  such  as  TXP.  This  work  also  expands  on  recent  work  to 
add  node-movement  capabilities  to  the  Mesquite  mesh  quality  improvement  library  [1]  and 
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involved  adding  support  for  the  27 -point  hexahedral  element. 

2.  Mesh  Optimization.  Mesh  optimization  here  refers  to  a  technique  for  moving  ver¬ 
tices  without  changing  the  connectivity.  The  optimization  uses  an  objective  function  to  quan¬ 
titatively  compare  the  quality  of  alternative  meshes.  Various  constraints  are  also  used,  for 
example  limits  on  the  positions  of  boundary  vertices.  Then  the  optimization  attempts  to  find 
the  mesh  with  the  minimum  objective  function  score  that  satisfies  the  constraints.  Often  the 
objective  function  is  a  combination  of  individual  component  qualities,  for  example,  in  this 
approach  we  select  many  sample  points  and  combine  the  local  quality  metric  scores  for  all 
the  sample  points  to  create  the  objective  function.  As  of  this  writing,  most  effort  on  mesh 
optimization  has  focused  on  linear  elements.  This  has  produced  good  element  quality  metrics 
for  many  linear  elements. 

Currently,  Mesquite  supports  multiple  methods  to  constrain  vertex  positions.  Among 
them  are  (a)  fixed  (the  vertex  does  not  move),  (b)  free  (the  vertex  can  move  anywhere  in  the 
space),  and  (c)  constrained  to  some  geometry  (the  vertex  stays  within  some  lower  dimensional 
space,  such  as  a  plane,  curve,  or  point). 

Some  work  has  dealt  with  tangled  meshes  (meshes  containing  inverted  elements)  using 
mesh  optimization.  Multiple  mesh  untangling  algorithms  have  been  proposed  (see  [4,  5, 
6,  21]  and  the  contained  references).  However,  none  of  these  algorithms  have  theoretical 
guarantees  of  untangling.  Also,  focusing  only  on  untangling  means  that  the  results  often  have 
elements  of  poor  shape  quality.  This  issue  is  likely  to  arise  in  quadratic  element  meshes  often, 
since  curving  the  boundary  of  the  initial  linear  mesh  to  the  boundary  geometry  inverts  some 
elements  and  causes  others  to  have  poor  shape  quality.  The  approach  in  this  work  does  not 
use  such  untangling  approaches,  instead  using  a  TXP  approach  in  an  attempt  to  untangle  the 
mesh  while  addressing  other  shape  quality  issues. 

In  the  Target-matrix  paradigm,  each  element  of  the  mesh  has  a  C1  mapping  from  a  logical 
element  to  itself  (the  maps  used  in  this  work  are  described  in  appendix  A).  For  an  ideal  target 
element,  the  Jacobian  matrix  of  this  mapping  is  referred  to  as  W.  For  an  active  element  the 
Jacobian  matrix  is  denoted  using  A.  From  this  one  can  get  a  Jacobian  matrix  for  a  transfor¬ 
mation  from  the  target  element  to  the  active  element  T  =  AW~l.  This  Jacobian  is  evaluated 
at  some  sample  point,  k ,  and  is  used  as  the  input  to  a  local  quality  metric  which  determines 
the  quality  of  the  mesh  at  the  sample  point,  /ik  =  ju(7\).  The  local  quality  metric  gives  a  non¬ 
negative  real  number  that  represents  how  well  A  represents  W,  often  using  properties  such  as 
shape,  size,  and  orientation. 

Local  quality  metrics,  fi^,  in  the  Target-matrix  paradigm  can  be  classified  based  on 
whether  they  are  barrier  metrics  or  not.  A  barrier  metric  can  be  used  with  a  mesh  that  is 
not  initially  inverted  to  guarantee  that  the  resulting  optimal  mesh  will  not  be  inverted.  A  non¬ 
barrier  metric  should  be  used  when  the  initial  mesh  is  inverted  since  it  allows  elements  to 
transition  between  being  inverted  and  non-inverted.  If  optimization  with  a  non-barrier  metric 
creates  a  non-inverted  mesh,  the  result  can  be  further  optimized  using  a  barrier  metric  without 
worry  of  the  result  being  inverted. 

The  first  metric  of  interest  is  the  Shape  metric  in  [7,  8].  It  was  shown  that  /is  >  0  for  all  T 
and  /is  =  0  if  and  only  if  T  is  a  scaled  rotation  matrix.  In  that  case,  A  has  the  form  A  =  sRW 
with  arbitrary  positive  scalar  s  and  arbitrary  rotation  R.  Then  the  shape  of  the  active  matrix  is 
the  shape  of  the  target-matrix.  The  barrier  form  of  the  shape  metric  in  2D  or  3D  is: 


I^sb,2d(T)  - 


(2.1) 


2det(T) 


VSB,3d(T)  - 


(2.2) 


3  V3  det(T) 
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Note  that  ||  •  || p  refers  to  the  Frobenius  norm  (the  square  root  of  the  sum  of  the  squares  of  all 
elements  in  a  matrix).  The  second  metric  of  interest  is  the  Size  (Sz)  metric,  given  in  both  2D 
and  3D  by: 


Hsz(T)  =  (det(T)  -  l)2 


(2.3) 


This  metric  obeys  /Jsz  ^  0  for  all  T  and  /Jsz  =  0  if  and  only  if  det(T )  =  1,  i.e.,  det(A)  = 
det(W).  Thus,  at  the  minimum,  the  local  sizes  of  the  active  and  target  matrices  agree.  Because 
this  metric  has  no  barrier,  it  can  potentially  untangle  an  inverted  mesh  as  well  as  improve 
relative  size.  It  does  not,  however,  encourage  the  shape  of  the  active  element  to  be  close  to 
the  shape  of  the  target  element.  Thus,  the  target  is  chosen  so  that  its  primary  use  in  the  present 
application  is  to  encourage  mesh  untangling  because  det{W)  >  0. 

In  [9],  many  issues  in  measuring  sample  point  quality  are  investigated.  A  label  invari¬ 
ance  property  is  also  discussed,  which  is  important  since  there  are  many  possible  mappings 
from  a  logical  element  to  another  elements  based  on  the  correspondence  between  logical  and 
physical  vertices.  Conditions  are  derived  for  the  metric  formulation,  sample  point  selection, 
and  target  element  selection  to  guarantee  label  invariance  of  the  local  quality  metric.  It  was 
shown  that  the  metric  s/element  types  used  in  the  present  study  are  label  invariant. 

The  selection  of  sample  point  locations  is  an  important  issue  and  can  affect  the  label 
invariance  of  the  metric  and  the  ability  to  detect  inverted  elements.  Since,  the  current  study 
does  not  attempt  to  provide  robust  untangling,  it  is  the  label  invariance  that  dominates  this 
selection.  Specifically,  we  use  sample  points  at  all  the  vertex  positions  (comer,  mid-side, 
mid-face,  mid-element,  etc.)  that  are  used  to  define  the  element  for  all  elements.1  Further 
experiments  into  robust  untangling  are  left  for  future  work. 

The  local  qualities,  at  all  the  sample  points,  k ,  are  combined  into  an  objective  function 
to  be  used  in  our  numerical  mesh  optimization.  Specifically,  we  use  a  power-mean,  giving 
the  following  objective  function: 


(2.4) 


Sample  Point  k 


3.  Algorithm.  This  work  extends  the  work  of  the  previous  summer  [12]  in  which  it  was 
found  that,  for  quadratic  triangular  elements,  a  two  stage  algorithm  in  which  the  first  stage 
untangles  the  mesh  using  a  size  metric  and  the  second  stage  maintains  the  untangled  prop¬ 
erty  of  the  mesh  while  improving  shape  quality  by  using  a  shape  barrier  metric.  We  extend 
the  algorithm  to  scale  the  target  matrices  and  scale  the  and  work  with  quadratic  tetrahedral, 
quadrilateral,  and  hexahedral  elements.  This  gives  the  algorithm  depicted  in  figure  3.1.  We 
motivate  the  choices  made  in  this  algorithm,  then  give  the  proposed  algorithm  and  present 
pseudocode. 

The  algorithm  is  currently  structured  to  properly  handle  cases  of  non-inverted  meshes, 
inverted  meshes,  and  uninvertible  meshes.  The  area  optimization  is  the  primary  untangling 
step,  and  is  protected  by  checks  from  running  on  non-inverted  meshes  to  avoid  unnecessary 
computation  and  because  it  has  been  found  to  tangle  meshes  with  element  size  heterogeneity. 
The  shape  optimization  was  added  because  the  untangling  stage  often  produces  meshes  of 
poor  shape  quality.  It  is  protected  by  checks  to  prevent  operating  on  tangled  meshes  and 
also  is  run  on  input  non-inverted  meshes  of  poor  quality  in  an  attempt  to  improve  quality. 
This  is  different  from  the  previous  structure  of  two  stages,  one  after  the  other,  in  that  an 
initially  non-inverted  mesh  does  not  go  through  the  area  optimization  (which  may  potentially 
invert  the  mesh),  also  a  mesh  that  doesn’t  untangle  in  the  area  optimization  avoids  the  shape 

*Note  that  a  mesh  vertex  can  have  multiple  associated  sample  points. 
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Fig.  3.1.  Quadratic  element  untangling  and  shape  improvement  algorithm.  This  is  a  flow  depiction  of  the 
algorithm,  note  there  are  two  branches.  The  first  branch  is  decided  based  on  whether  the  mesh  is  inverted  ( top  is 
inverted ).  The  second  branch  is  decided  based  on  whether  the  mesh  is  poor  quality  and  not  inverted  (taking  the  top, 
otherwise  taking  the  bottom ). 


optimization,  whereas  before  it  would  go  through  the  shape  optimization  enough  to  produce 
an  error. 

The  target  matrices,  W,  used  in  the  optimization  steps  of  the  proposed  algorithm  are  in¬ 
tended  to  target  the  solvers  towards  ideal  elements,  here,  we  use  the  notation  from  the  LVQD 
matrix  decomposition  [10].  In  this  case  Qideal  corresponds  to  the  Jacobian  matrix  for  an  ideal 
element  of  standard  size,  skew,  and  orientation.  This  suffices  for  the  shape  optimization  since 
it  is  using  a  shape  barrier  metric  that  is  sensitive  only  to  the  shape  of  the  elements.  The  size 
optimization  however  also  uses  A  so  that  the  target  matrix  corresponds  to  an  element  of  ap¬ 
propriate  size.  Specifically  A  is  the  mean  over  all  the  sample  points  of  the  A  size  values  from 
the  LVQD  decomposition  of  the  Jacobian  matrix  of  the  mapping  at  that  sample  point.  This 
setting  of  A  corresponds  well  to  experiments  (not  presented  due  to  space)  that  confirmed  that 
a  setting  significantly  larger  or  smaller  tended  to  hinder  the  ability  of  the  area  optimization  to 
untangle  the  mesh. 


Fig.  3.2.  Boundary  Tangling.  Both  figures  depict  a  mesh  for  a  cube  with  a  hemisphere  cut  out.  Left  is  the  gen¬ 
erated  mesh  of  linear  tetrahedra,  whereas  the  right  figure  depicts  the  same  topology  with  quadratic  tetrahedra.  Note 
that  the  projection  of  the  mid-side  nodes  around  the  edge  of  the  hemispherical  cut  causes  the  triangular  boundary  of 
one  tetrahedron  to  become  tangled.  This  cannot  be  untangled  without  moving  vertices  on  the  boundary  of  the  mesh. 
Coloring  is  based  on  element  ID  since  ParaView  draws  extra  edges  to  connect  mid-nodes  to  the  rest  of  the  mesh. 


As  has  been  described  before,  high  order  meshes  tend  to  have  problems  on  the  bound¬ 
aries.  this  can  cause  some  trouble  if  the  boundaries  are  held  fixed.  One  potential  problem  is 
depicted  in  figure  3.2,  in  which  converting  the  linear  mesh  to  a  quadratic  one  causes  a  bound¬ 
ary  triangle  of  at  least  one  tetrahedron  to  become  inverted.  Thus  fixing  boundary  vertices  will 
mean  that  this  element  can  not  be  untangled  by  the  area  optimization.  This  problem  is  why 
the  boundary  geometric  constraints  were  added  to  this  algorithm.  Since  the  inverted  boundary 
triangle  can  now  be  untangled  by  moving  the  vertices  within  their  relevant  boundary  geome¬ 
tries. 

To  be  precise,  the  algorithm  starts  out  by  loading  the  mesh  and  determining,  for  each  ver- 


N.  Voshell,  P.  Knupp,  and  J.  Kraftcheck 


145 


tex,  which  geometry  the  vertex  belongs  to.  This  is  used  to  constrain  boundary  vertices  to  the 
appropriate  boundary  geometry  (eg.  plane,  cylindrical  surface,  line,  circle,  point).  The  sec¬ 
ond  step  is  to  detect  whether  the  mesh  contains  inverted  or  poor  quality  elements  (or  neither). 
This  is  accomplished  by  looping  over  all  the  sample  points  and  detecting  the  minimal  det(T ) 
value,  r min ,  and  the  maximal  /^sb  value,  (jdsB)min •  The  third  step  then  applies  only  if  some  in¬ 
verted  sample  point  is  found  (i.e.  if  <  0).  In  the  third  step,  an  area  optimization  attempts 
to  untangle  the  mesh.  Step  3a  constructs  the  target  matrices,  in  this  case  it  corresponds  to  an 
ideally  shaped  element  of  size  A  (which  is  the  average  of  the  sizes  detected  at  all  the  sample 
points).  The  area  optimization  (step  3b)  then  runs  to  convergence  (400  iterations  is  more  than 
enough  for  the  meshes  investigated)  using  the  geometric  constraints,  target  matrices,  and  the 
size  metric.  It  does  not  use  A  since  the  jds b  is  size  invariant.  Afterward  it  is  unknown  whether 
the  area  optimization  was  successful,  thus  step  3c  detects  if  the  mesh  is  inverted  or  has  poor 
quality.  Step  four  then  operates  to  improve  the  quality  of  the  mesh  without  inverting  it,  thus 
it  should  only  be  run  if  the  mesh  has  poor  quality  elements  and  is  not  inverted  (i.e.  >  0 

and  (jj>s B)max  <  Hcutoff)-  In  our  experiments,  we  use  iicutoff  -  2,  but  other  values  could  be 
used.  Step  four  starts  by  defining  the  target  matrices  (step  4a)  based  on  an  ideal  element  of 
unit  size,  then  the  shape  optimization  procedure  (step  4b)  is  run  to  convergence  using  the 
target  matrices  and  geometric  constraints  previously  defined.  This  shape  optimization  uses 
the  shape  barrier  metric  in  order  to  prevent  the  mesh  from  inverting. 

Quadratic  Element  Untangling  and  Shape  Improvement  Algorithm 

1 .  Load  mesh  and  determine  geometric  constraint 

2.  Compute  minimum  area,  rmin,  and  worst  quality,  {jdsB)max 

3.  If  rmin  <  0 

(a)  Construct  Target  Matrices,  W 

i.  Compute  size  parameter,  A 

ii.  W  =  A  Qideai  at  all  sample  points. 

(b)  Area  Optimization 

i.  Boundaries:  Constrained  to  geometry. 

ii.  Metric:  ]dsz(T )  (Size  metric,  no  barrier). 

iii.  Termination  Criterion:  1  iteration  outer,  400  inner  (converged). 

iv.  Solver:  SteepestDescent. 

(c)  Compute  Tmin  and  (jis  B)max 

4.  If  Tmin  >  0  AND  (jd$ B)max  ^  cutoff 

(a)  Construct  Target  Matrices,  W  =  Qideai  at  all  sample  points. 

(b)  Shape  Optimization 

i.  Boundaries:  Constrained  to  geometry. 

ii.  Metric:  /Jsb(T)  The  Shape  barrier  metric. 

iii.  Termination  Criterion:  1  iteration  outer,  400  inner  (converged). 

iv.  Solver:  SteepestDescent. 

The  treatment  of  the  boundary  vertices  in  theory  is  that  they  should  be  constrained  to  the 
appropriate  geometric  entity  (comer,  curve  or  surface).  This  is  implemented  by  setting  all  the 
vertices  to  free  and  placing  the  boundary  vertices  under  the  ownership  of  the  relevant  surface 
geometric  entity  in  step  1 .  This  ownership  is  respected  in  the  two  optimization  stages  where 
the  solver  optimizes  vertex  positions  using  unconstrained  optimization  techniques,  then  all 
vertices  are  snapped  back  to  their  owning  geometry. 

4.  Numerical  Examples.  In  this  section  we  present  results  of  this  algorithm  applied 
to  various  meshes  composed  of  four  types  of  quadratic  elements;  triangles,  quadrilaterals, 
tetrahedra,  or  hexahedra.  Details  of  these  different  element  types  are  presented  in  appendix 
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Fig.  4.1.  Smoothing  the  triangular  Part  mesh.  Input  is  left,  output  of  the  area  optimization  is  center,  and 
algorithm  output  is  right.  Note  that  the  area  optimization  produces  poorly  shape  elements  in  an  untangled  mesh. 


Fig.  4.2.  Smoothing  the  unperturbed  Homogeneous  mesh.  Initial  inverted  elements  are  at  the  top  of  the  two  cuts 
into  the  lower  edge  and  at  the  bottom  of  the  rightmost  cut  into  the  top.  Input  is  left,  output  of  the  area  optimization 
is  center,  and  algorithm  output  is  right. 


A.  Most  examples  were  created  using  Cubit  [19]  or  Triangle  [20]  by  first  creating  a  mesh 
of  linear  elements  and  then  converting  it  to  quadratic  elements  and  projecting  boundary  mid¬ 
side  and  mid-face  nodes  to  the  geometry.  Most  of  the  presented  examples  are  cases  in  which 
this  procedure  created  tangled  meshes. 

For  the  2D  meshes,  we  present  a  quadrilateral  result  (figure  4.3)  and  two  triangle  results 
(figures  4.1  and  4.2).  Figure  4.2  was  created  using  triangle,  and  the  rest  were  created  using 
Cubit.  All  three  of  the  2D  figures  (4.1,  4.2,  and  4.3)  were  created  such  that  the  conversion  to 
quadratic  elements  tangled  the  mesh. 


Fig.  4.3.  Smoothing  the  quadrilateral  Part  mesh.  Input  is  left,  output  of  the  area  optimization  is  center,  and 
algorithm  output  is  right.  Note  that  some  interior  edges  are  needlessly  curved  by  the  area  optimization. 


One  can  see  from  figures  4.1,  4.2,  and  4.3  that  the  proposed  algorithm  is  capable  of 
untangling  and  smoothing  some  2D  quadratic  elements  (both  tri6  and  quad9).  In  this  case  one 
can  also  see  that  some  of  the  interior  regions  far  from  the  tangled  elements  are  significantly 
affected  by  the  size  optimization.  Also,  many  of  such  elements  are  moved  back  to  nearly 
their  initial  position  by  the  shape  optimization.  This  computational  inefficiency  is  one  issue 
we  hope  to  address  in  future  work. 

For  the  3D  meshes  depicted  in  figures  4.4,  4.5,  and  4.6;  all  were  created  with  Cubit. 
However  the  hex27  mesh  (figure  4.6)  was  not  tangled  after  creation,  so  selected  boundary 
vertices  were  manually  moved  to  create  the  tangled  elements  on  the  boundary  of  the  mesh. 

The  figures  show  that  the  proposed  algorithm  is  capable  of  untangling  various  3D  tangled 
meshes.  Due  to  the  difficulty  of  visualizing  3D  meshes  it  is  difficult  to  assess  the  amount  that 
some  interior  elements  may  be  needlessly  changed  by  the  size  optimization. 
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Fig.  4.4.  Untangling  and  Smoothing  Tetrahedral  Cut  Cube  Mesh.  The  mesh  is  tangled  as  generated  by  Cubit, 
without  any  perturbations.  The  upper  row  of  the  figure  has  the  input  and  the  bottom  row  has  the  output.  The  leftmost 
figures  shows  the  elements  depicted  in  different  colors.  The  center  figures  show  the  inverted  elements  ( elements 
having  a  sample  point  where  det(T)  <  0)  along  with  a  wireframe  of  the  entire  mesh.  The  rightmost  figures  show 
the  poor  quality  elements  ( elements  having  a  sample  point  where  psb(T)  >2)  along  with  a  wireframe  of  the  entire 
mesh.  Note  that  the  algorithm  untangles  the  mesh  and  reduces  the  amount  of  poor  quality  elements. 


Fig.  4.5.  Untangling  and  Smoothing  Tetrahedral  Cut  Cylinder  Mesh.  This  figure  shows  the  results  of  the 
proposed  algorithm  on  a  mesh  created  by  taking  a  spherical  cut  out  of  a  cylinder  with  a  refined  top.  The  idea  for  this 
is  based  on  some  untangling  present  in  the  SLAC  mesh.  Furthermore,  the  mesh  is  tangled  as  generated  by  Cubit, 
without  any  perturbations.  The  left  three  figures  depict  the  input  and  the  right  three  figures  depict  the  output.  The 
leftmost  figures  of  each  side  show  the  elements  depicted  in  different  colors.  The  center  figures  of  each  side  show  the 
inverted  elements  ( elements  having  a  sample  point  where  det(T )  <  0)  along  with  a  wireframe  of  the  entire  mesh.  The 
rightmost  figures  of  each  side  show  the  poor  quality  elements  ( elements  having  a  sample  point  where  psb(T)  >  2) 
along  with  a  wireframe  of  the  entire  mesh.  Note  that  the  algorithm  untangles  the  mesh  and  reduces  the  amount  of 
poor  quality  elements. 


5.  Conclusions  and  Future  Work.  We  have  given  a  proposed  algorithm  extending  our 
previous  work  [12],  to  untangle  and  improve  the  shape  quality  of  meshes  containing  quadratic 
elements.  Various  aspects  of  this  algorithm  were  motivated,  and  results  were  shown  with 
6  point  triangular  elements,  9  point  quadratic  elements,  10  point  tetrahedral  elements,  and 
27  point  hexahedral  elements.  Plans  for  future  work  include  investigating  other  and  more 
robust  untangling  methods  into  the  algorithm.  We  also  plan  to  look  into  methods  to  improve 
the  efficiency  of  the  algorithm,  specifically  we  plan  to  investigate  constraining  some  of  the 
interior  vertices  to  prevent  them  from  moving  in  a  manner  that  doesn’t  hinder  the  untangling. 
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Fig.  4.6.  Untangling  and  Smoothing  Hexahedra.  In  this  figure,  we  show  the  results  of  untangling  and  smoothing 
the  3Dpart  Hex27  mesh  that  has  been  generated  by  paving  one  side,  sweeping  through  the  volume  and  perturbing 
some  of  the  corner  vertices  on  the  boundary  of  the  mesh.  The  upper  row  of  the  figure  has  the  input  and  the  bottom 
row  has  the  output.  The  leftmost  figures  shows  the  elements  depicted  in  different  colors.  The  center  figures  show  the 
inverted  elements  ( elements  having  a  sample  point  where  det(T )  <  0)  along  with  a  wireframe  of  the  entire  mesh.  The 
rightmost  figures  show  the  poor  quality  elements  (elements  having  a  sample  point  where  psb(T )  >2)  along  with  a 
wireframe  of  the  entire  mesh.  Note  that  the  algorithm  untangles  the  mesh,  it  also  improves  the  quality  of  the  poorest 
quality  elements,  though  that  is  difficult  to  see. 
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Appendix 


A.  Quadratic  Element  Maps  and  Target  Matrices.  The  four  shapes  focused  on  in  this 
work  are  depicted  in  figure  A.l,  which  shows  the  location  and  ordering  of  the  points.  Using 
the  standard  basis  vectors  for  R2  one  can  define  a  logical  triangle  which  serves  as  the  master 
element  for  the  transformations  used  by  quadratic  elements. 


7  - —  IB —  s 


Fig.  A.l.  Element  Shapes.  Depicted  are  the  6 point  triangle  ( left-most ),  9 point  quadrilateral  ( center  left),  10 
point  tetrahedron  (center  right),  and  27 point  hexahedron  (right-most). 


Based  on  the  vertex  ordering  depicted  in  A.l,  the  quadratic  triangle  has  the  following  map¬ 
ping: 


n- 1 

x,n(,(r,  s)  =  ^  Nj(r,  s)xt 
i= 0 

(A.l) 

u  =  1  -  r  -  s 

(A.2) 

Nq  =  u(2u  -  1) 
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The  10  point  tetrahedral  mapping  is  as  follows: 
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n— 1 

Xtetioir,  ,,*)  =  £  MM  s,  04  (A.9) 

i=0 

w  =  1  -  r  -  s  -  t 

Ni  =  u(2u  -  1)  N4  =  t(2t  -  1)  A8  =  4m 

A2  =  r(2r  -  1)  A5  =  4rw  A9  =  4rf  (A.  10) 

A3  =  ls'(21s'  -  1)  Ag  =  4rs  A10  =  4jt 

A7  =  4^ 


The  mapping  and  target  matrices  are  different  for  the  other  element  types.  The  Quad9 
(nine  point  quadrilateral)  and  Hex27  (twenty  seven  point  hexahedron)  elements  are  closely 
related  and  can  be  expressed  using  the  following  basic  approaches: 


A0  =  (r-l)(2r-l)(j-l)(2j-l) 
A!  =  r(2r  -  l)(s  -  l)(2s  -  1) 

N2  =  r(2r-  l)s(2s-  1) 

A3  =  (r  -  l)(2r  -  l)s(2s  -  1) 

A4  =  4r(l  -  r)(s  -  l)(2s  -  1) 


n-l 

Xquad9 (^  ^)  =  ^  ^  A/(r,  1S’)xi-  (A.  11) 

i=0 

A5  =  r(2r  -  1)4  s(l  -  s ) 

As  =  4r(l  -  r)^^  -  1) 

Ay  =  (r  -  l)(2r  -  1)4 j(1  -  s)  (A.  12) 
Ag  =  4r(l  -  r)4s(l  -  s) 


n-l 

Xhex2l(r,  S,  0  =  2]  MM  S,  04 

i=0 


M>  =  Zi(r)Zi(s)Zi(0 
M  =  h(r)h(sMt) 
M  =  h(r)h(s)h(t) 
M  =  h(r)h(s)h(t) 
M  =  h(r)h(s)h(t) 

M  =  h(r)h(sMt) 
M,  =  h(r)h(sMt) 
M  =  h(r)l3(s)l3(t) 
Ms  =  l2(r)h(s)h(t) 


k(g)  =  W-€) 

N9  =  h(r)l2(s)h(t) 
Mo  =  h(r)h{s)h(t) 

Mi  =  h(r)h(s)h(t) 
M2  =  0  000  0)0(0 
M3  =  h(r)h(s)h(i) 
N 14  =  h(r)h(s)l2(t) 
Ms  =  h(r)h(sMt) 
N\6  =  l2(>')l\ (s)h(t) 
M7  =  h(r)h(s)h(t) 


1) 

Nis  =  l2(r)h(s)h(t) 
N 19  =  h(r)h(s)h(t) 

Mo  =  h(r)l\(s)l2(t) 
Mi  =  h(r)l2(s)h(t) 
M2  =  h(r)h(s)h(t) 
Ms  =  h(rMs)h(t) 
M4  =  h(rMs)h(t) 
N25  =  h(r)h(s)h(t) 
Me  =  HrMsMt) 


(A.  13) 


(A.  14) 


For  an  isotropic  domain,  there  is  no  reason  for  the  target  element  to  break  symmetry,  so 
the  equilateral  triangle  is  a  good  choice  for  the  target.  For  theoretical  reasons  explained  in 
[10],  we  choose  the  area  of  the  target  equilateral  triangle  to  be  one.  Then  the  Jacobian  of  the 
logical  to  target  mapping  is  as  follows: 


Wideal,tri6  ~ 


d2</3 
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(A.  15) 


The  tetrahedral  ideal  element  also  corresponds  to  an  equilateral  element,  and  has  the 

following  W ideal'. 
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For  the  quadrilateral  and  hexahedral  elements,  the  W ideal  from  equilateral  elements  of 
unit  size  is  an  identity  matrix  (2x2  for  quadrilateral  and  3x3  for  hexahedral). 


W ideal, quad9  ~ 


^ ideal, hexll  ~ 


1  0 
0  1 

1  0  0 
0  1  0 
0  0  1 


(A.  17) 

(A.  18) 
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A  NEW  STRATEGY  FOR  UNTANGLING  2D  MESHES  VIA  NODE-MOVEMENT 

JASON  W.  FRANKS ^  AND  PATRICK  M.  KNUPP* 

Abstract.  A  new  mesh  optimization  strategy  for  untangling  quadrilateral  meshes,  based  on  node-movement,  is 
investigated.  The  strategy  relies  on  a  set  of  Propositions  which  show  that,  for  certain  quality  metrics  0  <  n  <  oo 
within  the  Target-matrix  paradigm,  if  ji  <  1  then  the  local  area  is  positive.  The  Propositions  are  exploited  in  devising 
a  new  strategy  for  simultaneous  mesh  untangling  and  quality  improvement.  Numerical  results  confirm  the  expected 
behavior  of  the  new  strategy. 

1.  Introduction.  A  common  problem  in  mesh  generation  is  the  creation  of  meshes  hav¬ 
ing  inverted,  invalid,  or  tangled  elements.  In  general,  such  meshes  cannot  be  used  in  computer 
simulations.  Many  methods  for  repairing  or  untangling  inverted  meshes  exist,  including  re¬ 
meshing,  local  mesh  modification,  and  mesh  optimization  and  smoothing  via  node-movement 
techniques.  Re-meshing  tends  to  be  non- automatic.  Local  mesh  modification  can  be  very  ef¬ 
fective  but,  for  the  most  part,  is  limited  to  simplicial  meshes.  Mesh  optimization  can  address 
meshes  containing  a  wider  variety  of  element  types,  but  current  methods  do  not  guarantee 
that  the  optimized  mesh  will  be  untangled.  While  we  recognize  the  value  of  re-meshing  and 
local  mesh  modification  methods  in  mesh  untangling,  in  this  paper  we  set  them  aside  in  order 
to  focus  on  node-movement  methods. 

There  exists  a  wide  variety  of  node  movement  methods  and  the  majority  of  them  do  not 
specifically  address  the  mesh  untangling  problem.  For  example,  Laplace  smoothing  is  widely 
used  to  create  ’smooth’  meshes  consisting  of  well- shaped  elements.  The  fact  that  Laplace 
smoothing  can  sometimes  untangle  an  inverted  mesh  is  somewhat  incidental  in  that  it  is  not 
specifically  designed  to  do  so.  In  fact,  it  is  well  known  that  Laplace  smoothing  can  create  a 
tangled  mesh  from  one  that  initially  is  untangled.  Many  other  smoothing  and  optimization 
methods  behave  similarly.  The  common  approach  to  avoiding  tangled  meshes  has  been  to 
devise  node-movement  strategies  with  ’invertibility  guarantees’,  i.e.,  they  guarantee  that  the 
result  of  the  node-movement  will  be  a  non-inverted  mesh.  Winslow  smoothing  is  perhaps  the 
first  node-movement  method  which  provided  such  a  guarantee  [11].  However,  the  guarantee 
is  limited  to  2D  meshes  only.  Moreover,  the  method  has  proved  somewhat  difficult  to  extend 
to  unstructured  meshes. 

Winslow  smoothing  is  based  on  the  numerical  solution  of  a  partial  differential  equation. 
Other  node-movement  strategies  are  based  on  numerical  optimization  of  objective  functions 
which  are  defined  in  terms  of  discrete  geometric  entities  such  as  edge-lengths,  angles,  areas, 
and  volumes.  These  methods  are  more  easily  applied  to  unstructured  meshes  and,  moreover, 
provide  node-movement  methods  similar  to  Winslow.  The  main  example  would  be  the  dis¬ 
crete  objective  function  based  on  a  summation  of  the  squares  of  edge-lengths,  divided  by  the 
local  area  or  volume: 


(1.1) 


These  methods  belong  to  a  set  of  methods  known  as  barrier  methods.  In  this  example,  ele¬ 
ments  having  volumes  close  to  zero  are  penalized  because  the  objective  function  tends  to  in¬ 
crease  rapidly  as  the  volume  approaches  zero.  If  all  the  elements  of  the  initial  mesh  have  pos¬ 
itive  volume,  then  optimization  of  this  objective  function  will  (when  properly  implemented) 
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guarantee  that  the  minimizing  mesh  is  untangled.  A  drawback  to  the  barrier  methods,  how¬ 
ever,  is  that  they  cannot  be  used  when  the  initial  mesh  is  tangled  (because  there  is  no  way  to 
cross  the  barrier  to  the  non-inverted  region). 

Two  approaches  have  been  proposed  to  overcome  the  limitation  that  barrier  methods 
cannot  be  applied  to  tangled  initial  meshes.  In  the  first,  the  barrier  is  moved  by  replacing  V  in 
the  objective  function  by  V  -  Vmin,  where  Vmin  is  the  minimum  volume  over  the  initial  mesh. 
A  homotopy  method  is  applied  to  gradually  increase  the  minimum  volume  until  it  is  positive 
(see  [1]  for  details).  While  this  method  is  often  effective,  it  is  relatively  expensive  and  does 
not  guarantee  that  the  final  result  will  be  non-inverted.  In  the  second  approach,  the  barrier  is 
removed  entirely,  with  V  being  replaced  by  a  blending  function  that  is  approximately  equal 
to  ( V  +  \V\)/2.  As  V  goes  from  positive  to  negative,  the  blending  function  transitions  from 
V  to  zero,  so  that  elements  having  negative  areas  are  heavily  penalized  (see  [2]  for  details). 
While  effective,  the  method  does  not  provide  an  invertibility  guarantee. 

As  an  alternative  to  node-movement  methods  that  provide  an  invertibility  guarantee, 
there  exist  methods  which  directly  try  to  untangle  an  inverted  initial  mesh.  We  refer  to  those 
methods  as  pure  untanglers  because  they  are  specifically  designed  to  untangle  meshes  and 
generally  neglect  other  aspects  of  mesh  quality  such  as  element  Shape.  The  intended  use  of 
pure  untangle  methods  is  to  first  untangle  a  mesh  and  then  to  subsequently  apply  a  barrier 
method  to  improve  Shape  or  other  mesh  qualities. 

Freitag  and  Plassman  developed  a  pure  untangler  based  on  a  system  of  linear  programs 
with  the  intent  to  maximize  the  minimum  area/volume  of  elements  within  each  local  patch  of 
the  mesh  [3,  4].  When  optimized,  the  system  is  guaranteed  to  converge,  but  not  necessarily 
to  an  untangled  mesh.  This  method  proves  to  be  very  effective  for  untangling  but  counter¬ 
productive  for  mesh  improvement,  because  ‘a  small  but  perfectly  shaped  element  is  likely  to 
be  distorted  in  an  effort  to  maximize  its  area’  [10]. 

Shashkov  et.  al.  used  a  computational  geometry  approach  to  compute  feasible  regions  in 
which  a  vertex  could  be  placed  to  produce  positive  areas  [6,  7,  10].  The  feasibility  region  for 
a  given  free  vertex  is  the  space  in  which  that  vertex  can  be  shifted  such  that  all  of  its  incident 
elements  become  or  remain  uninverted.  By  constructing  halfspaces  parallel  to  the  edges  of 
each  element  adjacent  to  the  vertex  in  question,  the  feasibility  region  can  be  constructed  from 
the  intersection  of  these  halfspaces  (note  that  the  feasible  region  can  be  empty  for  a  given 
free  vertex).  The  free  vertex  is  then  placed  in  the  center  of  this  feasible  set.  The  feasible 
region  approach  to  untangling  proves  very  effective  in  removing  large  quantities  of  invalid 
elements  from  meshes,  and  is  useful  in  multi-stage  algorithm  approaches  to  untangling.  But 
disadvantages  do  exist  when  using  this  method.  First,  in  local  submeshes  where  elements 
take  extreme  shapes  or  very  small  size,  the  feasibility  region  is  naturally  very  small.  When  a 
large  cluster  of  vertices  and  elements  exist  with  minuscule  feasibility  regions  it  often  occurs 
that  passes  over  the  submesh  will  yield  null  or  oscillatory  results,  as  moving  one  vertex  to 
untangle  one  element  will  tangle  and  invert  another  element.  The  second  weakness  is  that 
the  feabible  regions  are  more  difficult  to  calculute  for  elements  having  non-planar  faces  (e.g. 
hexahedra).  The  commonly  practiced  solution  in  this  event  is  to  employ  the  Simplex  method. 
By  optimizing  just  four  linear  functions,  f-  x,f=-x,f=y,f  =  -y,  as  opposed  to  one  for  every 
edge,  and  if  the  feasible  set  is  not  empty,  a  subset  can  be  found  [10]. 

Knupp  gave  a  method  of  untangling  that  used  a  local  non-barrier  metric  that  penalized 
inverted  elements  [7].  Global  optimization  of  the  objection  function  proved  to  be  more  robust 
in  terms  of  untangling  larger  and  more  difficult  meshes,  as  every  element  is  considered  only 
once  and  simultaneously  via  a  global  sweep.  This  ’Untangle-Beta’  metric  contained  a  global 
parameter  / 3  which  was  needed  to  ensure  that  elements  of  exactly  zero  area  were  not  created. 
Proper  selection  of  the  value  was  sometimes  problematic,  particularly  if  the  area  of  elements 
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in  the  mesh  was  highly  variable.  The  choice  of  the  value  of  ft  sometimes  made  the  difference 
as  to  whether  the  optimal  mesh  was  tangled  or  not.  Another  problem  with  the  metric  was 
that  it  was  non-differentiable  at  certain  points.  R.  Liska  showed  that  this  problem  could  be 
fixed  by  squaring  the  metric,  thus  making  the  method  susceptible  to  standard  optimization 
methods.  As  with  the  other  pure  untanglers,  there  is  no  guarantee  of  obtaining  an  untangled 
mesh  with  this  method. 

The  focus  in  the  present  paper  is  on  a  new  strategy  for  untangling  2D  meshes  via  the 
Target-matrix  Optimization  Paradigm  (TMOP)  [9].  TMOP  is  an  algebraic  approach  rooted 
in  matrix  analysis,  allowing  for  the  development  of  metrics  with  the  property  to  be  able  to 
improve  or  preserve  specific  mesh  qualities  such  as  element  size,  shape,  and  orientation. 
This  is  accomplished  by  establishing  a  collection  of  sample  points  within  the  elements  of  the 
mesh,  as  well  as  a  set  of  Target-matrices  that  describe  the  desired  optimal  mesh.  For  linear 
quadrilateral  mesh  elements,  one  sample  point  is  normally  assigned  per  corner,  and  that  is 
the  choice  used  in  this  work.  Local  metrics  in  TMOP  are  functions  from  a  set  of  real,  square 
matrices  TjXd,  with  d  =  2  for  2D  meshes.  In  turn,  T  =  AW~l ,  where  A  is  the  Jacobian 
matrix  of  the  element  map  and  W  the  target  matrix.  Target  matrices  are  constructed  prior  to 
the  optimization  phase  and  describe  the  Jacobian  of  the  map  in  the  optimal  mesh;  targets  are 
constructed  such  that  det{W)  >  0.  The  determinant  of  A  gives  the  local  size  at  a  sample  point 
in  the  active  mesh  and  r  =  det(T )  =  det(A) / det(W)  measures  Sizes  in  the  active  mesh  relative 
to  the  desired  optimal  mesh.  In  order  to  use  TMOP  metrics,  one  must  specify  the  targets.  In 
this  study,  targets  of  the  form 


W  =  AS  ideai  (1.2) 

are  used.  The  first  term  in  this  product  is  a  scalar  and  is  equal  to  the  average  edge  length  in 
the  mesh.  The  second  term  in  the  product  is  a  matrix  which  describes  the  ideal  shape  of  the 
element;  for  quadrilateral  elements,  S ideal  is  the  Identity  matrix.  This  particular  form  of  the 
Target-matrices  says  that  the  desired  Jacobian  matrices  in  the  optimal  mesh  should  be  squares 
whose  edge  lengths  are  given  by  A.  Although  in  general  the  set  of  Target-matrices  can  vary 
from  one  sample  point  to  another,  in  this  study  a  constant  Target  suffices.  If  a  different  target 
form  were  chosen,  the  results  in  this  study  might  change. 

To  clarify  the  terminology  used  in  this  paper,  we  first  define  a  mesh  to  be  tangled  or 
inverted  if  it  contains  an  inverted  element.  In  the  Finite  Element  Method,  an  inverted  element 
corresponds  to  an  element  for  which  there  is  a  point  within  the  master  element  for  which  the 
map  is  non-invertible.  Unfortunately,  it  can  be  rather  difficult  to  tell  whether  or  not  an  element 
is  inverted  based  on  this  definition,  particularly  for  linear  hexahedral  elements  and  quadratic 
or  higher-order  elements.  For  convenience,  we  substitute  for  this  definition  the  following: 
within  TMOP,  an  element  is  inverted  if  it  contains  an  invalid  sample  point.  An  invalid  sample 
point  is  a  sample  point  for  which  the  determinant  of  the  matrix  A  (or  T)  at  the  sample  point  is 
less  than  or  equal  to  zero.  An  element  containing  an  invalid  sample  point  is  always  inverted, 
but  elements  can  also  be  inverted  (according  to  the  strict  Finite  Element  definition)  without 
containing  an  invalid  sample  point.  In  practice,  this  substitute  definition  for  inverted  elements 
is  quite  useful. 

2.  Establishing  a  Baseline.  In  this  section  we  describe  four  tangled  initial  meshes  and 
show  the  results  of  applying  either  Laplace  smoothing  or  the  Untangle-Beta  metric  to  them. 
This  is  useful  when  later  comparing  results  to  the  new  untangle  methods  described  later.  The 
four  tangled  meshes  in  Figure  2.1  constitute  the  set  of  2D  test  meshes  we  use  to  study  the 
behavior  of  untangling  and  other  algorithms’  behavior. 

All  of  our  2D  test  meshes  are  composed  of  quadrilateral  elements  because  experience 
has  shown  that  2D  meshes  with  triangle  elements  are  relatively  easy  to  untangle.  The  tangled 
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(c)  Inverted  Hole 


(d)  Shest  Grid 


Fig.  2.1.  Four  tangled  test  meshes. 


Horseshoe  was  created  by  applying  the  Laplace  smoother  to  an  untangled  Horseshoe  mesh. 
The  Hole-in-Square  mesh  is  a  result  of  a  paved  mesh  subjected  to  Laplacian  smoothing. 
The  Inverted  Hole  mesh  was  produced  by  a  sweeping  algorithm.  The  Shest  grid  mesh  was 
produced  by  an  ALE  calculation.  It  is  known  a  priori  that  untangled  meshes  having  the  same 
mesh  topology  in  all  four  of  these  test  meshes  exist.1  More  difficult  test  meshes  than  these 
four  exist,  but  the  four  are  adequate  for  a  preliminary  investigation  of  our  strategy. 

The  degree  of  inversion  or  tangling  varies  from  one  initial  mesh  to  another,  with  as  little 
as  9  inverted  elements  (Shest  Grid)  to  40  (Inverted  Hole).  As  was  mentioned  in  Section  1, 
some  smoothers  that  improve  mesh  quality  (e.g.,  Laplace  smoothing)  can  incidentally  un¬ 
tangle  meshes  without  having  been  designed  with  this  goal  as  their  primary  function.  For 
comparison  to  our  new  algorithms,  Fig.  2.2  shows  how  the  Laplacian  smoother  performed 
on  our  test  meshes.2  The  Horseshoe  and  Hole-in-Square  meshes  are  not  shown  because  the 
Laplacian  smoother  (run  to  convergence)  was  used  to  generate  them  and  therefore  applying 
Laplace  smoothing  again  will  fail  to  untangle  them. 


lln  general,  given  a  tangled  mesh,  there  is  no  way  to  determine  a  priori  if  an  untangled  mesh  with  the  same 
connectivity  exists. 

2  All  numerical  results  in  this  paper  kept  the  boundary  vertex  coordinates  unchanged  so  that  only  interior  vertices 
were  moved  as  a  result  of  smoothing  or  optimization. 
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(a)  Shest  Grid  -  Laplacian 


(b)  Inverted  Hole  -  Laplacian 


Fig.  2.2.  Laplacian  smoothing  applied  to  two  initially  tangled  meshes. 


As  can  be  seen  in  Fig  2.2,  the  Laplacian  smoother  completely  untangled  the  Shest  and 
Inverted  Hole  meshes  when  driven  to  convergence.  This  illustrates  that  some  smoothers  can 
untangle  a  mesh,  even  though  that  is  not  their  main  purpose.  Of  course,  we  see  that  the  Shest 
grid’s  original  character  was  destroyed  during  the  untangling,  so  that  could  make  the  use  of 
Laplace  inappropriate  in  this  instance.  If  the  Laplacian  smoother  is  not  driven  to  convergence, 
results  may  differ  in  terms  of  number  of  inverted  elements  and  invalid  points. 

As  an  additional  comparison,  we  apply  the  Untangle-Beta  untangler  to  the  four  tangled 
test  meshes.  The  Untangle-Beta  metric  has  the  form: 

=  [\t  -  p\-  (t  -  P)}1  (2.1) 

with  p  >  0  an  adjustable  parameter.  This  metric  is  non-negative,  with  /ip  -  0  if  and  only  if 
r  >  p  [6].  Unlike  the  Laplacian  smoother,  the  Untangle-Beta  metric  was  specifically  designed 
to  untangle  meshes,  i.e.  it  is  a  ’pure’  untangler  that  does  not  guarantee  Shape  or  Size  quality. 
Figure  2.3  shows  the  results.3  With  p  set  to  A2/ 20  (approximately  l/20th  of  the  average  cell 
area)  for  each  mesh,  all  four  meshes  were  untangled.  As  expected  for  pure  untanglers,  the 
Shape  quality  in  these  meshes  is  poor. 


(a)  Horseshoe 


(b)  Shest  Grid 


3 The  Laplace,  Untangle-Beta,  and  newer  algorithms  were  implemented  in  the  Mesquite  code. 
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(c)  Hole  in  Square 


(d)  Inverted  Hole 


Fig.  2.3.  All  four  meshes  after  optimization  with  the  pp  metric. 


Initial 

Laplace 

Me 


Horseshoe 

-0.178098 

-0.178098 

0.00480571 


Shest  Hole-in-Square 

-0.315243  -0.957579 

0.106715  -0.957579 

0.00121174  0.0202237 

Table  2.1 


Tmin  for  each  mesh  before  and  after  Laplace  and  pp. 


Inverted  Hole 
-1.47294 
0.707527 
0.014298 


Number  of  Inverted  Elements 


Horseshoe 

Shest 

HoleSq 

Inv  Hole 

I 

O 

I 

O 

I 

O 

I 

O 

Laplacian 

12 

12 

9 

0 

23 

23 

40 

0 

Me 

12 

0 

9 

0 

23 

0 

40 

0 

Table  2.2 

Initial  Mesh  (I)  and  Optimized  Mesh  (O). 


Number  of  Invalid  Sample  Points 


Horseshoe 

Shest 

HoleSq 

Inv  Hole 

I 

O 

I 

0 

I 

O 

I 

O 

Laplacian 

44 

44 

9 

0 

70 

70 

156 

0 

Me 

44 

0 

9 

0 

70 

0 

156 

0 

Table  2.3 


Initial  Mesh  (I)  and  Optimized  Mesh  (O). 


Define  rmin  to  be  the  minimum  value  of  r  over  all  the  sample  points  in  a  given  mesh.  A 
non-positive  value  for  rmin  signifies  a  tangled  mesh.  Entries  in  Table  2.1  give  the  minimum 
values  for  each  of  the  four  initial  meshes  and  the  results  after  Laplace  smoothing  and  the 
Untangle-Beta  were  applied.  Because  rmin  >  0  after  optimization  with  /dp,  we  conclude  that 
the  Untangle-Beta  metric  is  able  to  untangle  all  four  test  meshes.  Notice  also  that  when 
Laplacian  smoothing  is  able  the  untangle  a  mesh,  the  corresponding  value  of  rmin  is  often 
larger  than  what  one  gets  when  using  Untangle-Beta.  This  is  because  Laplacian  smoothing  is 
not  a  pure  untangler. 

Tables  2.2  and  2.3  report  the  number  of  inverted  elements  and  the  number  of  invalid 
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sample  points  in  each  of  the  four  meshes,  both  before  and  after  optimization. 

Except  for  Laplace,  all  numerical  optimization  results  reported  in  Sections  2  and  3  used 
the  Block  Coordinate  Descent  method  (local  patches  with  a  global  objective  function),  a 
tight  termination  criterion  of  ’absolute  vertex  movement’  less  than  A/ 1000,  and  Mesquite’s 
Steepest  Descent  Solver.  The  Objective  Function  was  simply  the  linear  average  of  the  metric 
over  all  sample  points  in  the  mesh.4 

3.  A  New  Strategy  for  Mesh  Untangling.  Previous  untangling  strategies  rely  on  com¬ 
putational  geometry,  simplex  methods,  or  on  ’untangling’  metrics.  In  this  section  we  ex¬ 
tend  the  latter  strategy  though  the  use  of  hybrid  metrics  which  improve  Size  or  Shape+Size, 
while  at  the  same  time  encouraging  untangling.  Our  strategy  is  particularly  interesting  in 
that  the  approach  is  based  on  a  series  of  newly  proved  Propositions.  First  we  briefly  discuss 
previously- studied  Size  and  Shape+Size  metrics  and  how  they  behave  with  respect  to  untan¬ 
gling.  Second,  we  prove  certain  Propositions  about  these  metrics.  Third,  the  Propositions 
are  exploited  to  devise  hybrid  metrics  that  encourage  untangling,  as  well  as  improve  Size 
and  Shape+Size.  Fourth,  numerical  results  for  the  new  strategy  are  shown,  based  on  the  four 
tangled  initial  meshes. 

3.1.  Size  and  ShapeSize.  In  this  section  we  consider  two  previously  proposed  TMOP 
quality  metrics  [8]  because  they  are  key  components  in  our  new  strategy  for  untangling.  The 
first  metric  is  the  so-called  relative  Size  metric,  given  by 

Hsz(T)  =  (t-  l)2  (3.1) 

It  is  easy  to  see  that  0  <  /isz  with  A lsz  =  0  if  and  only  if  r  =  1.  The  primary  purpose  of  this 
metric  is  thus  to  make  the  local  Size  in  the  optimal  mesh  as  close  as  possible  to  the  Size  A 
specified  by  the  Target-matrices.  Although  the  metric  is  not  designed  to  be  a  pure  untangle 
metric,  it  is  clear  that  this  metric  would  tend  to  untangle  meshes  since  fisz  =  0  means  that 
det(A)  =  det(W)  >  0. 

The  second  metric  is  designed  to  improve  element  Shape  and  Size: 

HshSz(D  =  \T\2-2^T)  +  2  (3.2) 

with 

W)  =  Vm2  +  2r  (3.3) 

In  previous  work  it  was  shown  that  0  <  jishsz  with  Hshsz  =  0  if  and  only  if  T  =  R,  where  R  is 
any  rotation  matrix.  That  means  the  metric  is  invariant  to  orientation,  but  sensitive  to  Shape 
and  Size.  Although  the  metric  is  not  designed  to  be  a  pure  untangle  metric,  it  is  clear  that  this 
metric  would  tend  to  untangle  meshes  since  /dsz  =  0  means  that  det(T )  =  det(R)  -  1  and  thus 
det(A )  >  0. 

So,  much  like  Laplace  smoothing,  both  the  Size  and  the  ShapeSize  metric,  with  the 
Target-matrix  given  in  equation  (1.2),  will  tend  to  create  untangled  meshes  when  optimized; 
they  do  not  guarantee  it,  however.  The  two  metrics  are  applied  to  the  four  tangled  test  meshes 
to  illustrate  the  expected  behavior.. 

4One  point  on  which  we  elaborate  is  the  use  of  the  Block  Coordinate  Descent  (BCD)  algorithm.  An  alternative 
provided  in  Mesquite  is  the  Global  Patch  solver  which  moves  all  the  ’free’  mesh  vertices  simultaneously.  Unfortu¬ 
nately,  it  was  found  that  the  Global  Patch  did  not  prove  as  effective  as  BCD  in  terms  of  speed  or  robustness.  The 
Shest  Grid  eventually  converged  to  the  same  results  as  BCD,  but  took  a  substantially  longer  time  to  terminate  with 
Global  Patch,  and  the  Hole-in-Square  mesh  always  terminated,  but  not  always  to  the  same  final  mesh  (suggesting 
that  there  may  be  tangled  local  minima  in  the  global  objective  function).  Problems  also  arose  when  Global  Patch 
was  utilized  with  the  SizeUntangle  metric  because  convergence  to  the  final  untangled  mesh  for  this  metric  took  the 
longest  amount  of  computer  time. 
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(a)  Horseshoe 


(b)  Shest  Grid 


(c)  Hole-in-Square 


(d)  Inverted  Hole 


Fig.  3.1.  All  four  meshes  after  optimization  with  the  psz  metric. 


Figure  3.1  gives  results  for  optimization  with  the  Size  metric.  As  this  figure  demon¬ 
strates,  the  Size  metric  created  near-equal  area  elements  and  somewhat  non- Smooth  meshes; 
this  is  the  expected  behavior  of  this  metric.  More  to  the  point,  it  was  able  to  untangle  three  of 
the  four  meshes.  It  failed  to  untangle  the  Hole-in-Square,  but  reduced  the  number  of  inverted 
elements  from  23  to  6.  We  conjecture  that,  had  we  used  a  more  sophisticated  Target-matrix  in 
which  the  Size  factor  is  allowed  to  vary  with  the  sample  point,  the  mesh  might  have  untangled 
with  the  Size  metric. 

Figure  3.2  shows  how  the  ShapeSize  metric  performed.  The  same  three  meshes  were 
untangled  when  using  the  ShapeSize  metric.  The  Size  metric  reduced  the  Hole-in-Square 
mesh  to  6  inverted  elements  compared  to  12  for  the  ShapeSize  metric.  Comparing  the  results 
in  Figure  3.2  to  3.1  one  sees  the  expected  characteristic  differences  between  the  Size  and 
ShapeSize  metrics.  The  results  of  ShapeSize  bears  a  strong  resemblance  to  the  Laplacian 
results  as  well  for  the  Shest  Grid  and  the  Inverted  Hole  meshes. 

Now  that  we  have  some  data  on  the  effects  of  some  basic  TMOP  metrics  on  tangled 
meshes,  we  give  some  Propositions  that  will  allow  us  to  develop  new  untangling  metrics. 

3.2.  Propositions.  In  this  section  it  will  be  shown  that  certain  local  metrics  yu  within 
TMOP  enjoy  the  following  property:  for  any  d  x  d  real  matrix  T ,  if  fi(T)  <  1,  then  r  >  0. 

Proposition  3.2.1.  Let  T  be  a  d  x  d  matrix.  If  /i(T)  =  (r  -  l)2  <  1,  then  0  <  r  <  2. 
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(a)  Horseshoe 


(b)  Shest  Grid 


(c)  Hole-in- Square 


(d)  Inverted  Hole 


Fig.  3.2.  All  four  meshes  after  optimization  with  the  pshSz  metric. 


Proof.  Suppose  the  metric  is  less  than  one.  Then 

(r  -  l)2  <  1 

t2  -  2r  +  1  <  1 

t2  -  2r  <  0 


t(t  -  2)  <  0 

So  either:  A)  r  <  0  and  r-2>0orB)r>0  and  r  -  2  <  0. 

Case  A  is  impossible,  leaving  B:  0  <  r  <  2.  □ 

Thus,  when  the  Size  metric  is  less  than  1,  the  determinant  is  positive.  On  the  other  hand, 
having  the  determinant  positive  does  not  necessarily  mean  that  the  Size  metric  will  be  less 
than  1. 

Here  is  another  metric  which  enjoys  a  similar  property: 
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Proposition  3.2.2.  Let  T  be  a  dxd  real  matrix.  If  fi(T)  =  |7’|2-2  \jj(T)+2  <  1,  thenr  >  0. 

Proof.  Suppose  the  metric  is  less  than  one.  Then 

|7f  -  lijj  +  2  <  1 

|7f  -  24/  +  1  <  0 

|7f  +  1  <  2f 

But  (|r|  -  l)2  >  0  leads  to  2|7’|  <  \T\2  +  1.  Combining  these  results  gives 

2|T|  <  \T\2  +  1  <  2^ 

and  thus  \T\  <i//.  Therefore 

m<^  =  Vm2 + 2r 

\T\2  <  |7f  +  2r 


and  so  we  have  0  <  r.  □ 

Although  Proposition  3.1.2  is  proved  for  dxd  matrices,  the  metric  itself  is  not  a  Shape 
metric  unless  d  -  2,  and  thus  the  Proposition  is  only  useful  for  2D  meshes.  For  3D  meshes, 
the  Shape  metric  has  a  different  form  [5]. 

The  converse  to  this  Proposition  is  only  true  if  T  is  such  that  =  1.  To  see  this, 
suppose  0  <  r.  Then 


0  <  2r 
\T\2  <  |7f +  2t 
|7f  <  if,2 

|T|2  —  2 ijj  +  2  <  \}/2  —  2if/  + 1 

MO<  1+0A-1)2  (3.4) 

As  a  final  result,  we  have  the  following  Proposition  for  the  2D  non-barrier  Shape  metric 
Hsh(T)  =  |T|2  -  2r: 

Proposition  3.2.3  Let  T  be  a  2  x  2  matrix.  If  jishiT)  =  0,  then  r  >  0. 

Proof.  In  [8]  it  was  shown  that  fish  =  0  if  and  only  if  T  is  a  scaled  rotation,  i.e.,  T  has  the 
form  T  -  sR.  In  that  case,  r  =  det(sR)  =  s1  >  0.  □ 

3.3.  Exploiting  the  Propositions.  Both  the  Size  and  ShapeSize  metrics  have  the  poten¬ 
tial  to  untangle  meshes  because  their  target  matrices  are  non-inverted.  Based  on  the  Propo¬ 
sitions  of  the  previous  section,  these  metrics  can  further  be  used  to  devise  new  metrics  that 
encourange  untangling  even  more.  Let  /i  be  a  TMOP  metric  such  that  {i  <  1  guarantees  that 
0  <  t.  Then  consider  the  following  function  f(T)  -  f(ja(T)),  which  can  be  considered  to  be 
a  TMOP  untangle  metric: 

m  =  m-e\-n\-([l-e\-n)}2  (3.5) 

with  0  <  6  a  number  that  is  small  compared  to  1.  It  is  easy  to  see  that  f(T)  >  0.  Further, 
f(T)  =  0  if  and  only  if  fi  <  l  -  e  <  l.  Therefore,  if  /  =0,  then  r  >  0  (note  the  similarity  with 
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Propositions  3.2.1  and  3.2.2  thus  allow  us  to  create  two  new  metrics  having  an  increased 
potential  to  untangle  meshes: 

fszu  =  {|[1  -  6]  -  m I  -  ([1  -  e]  -  rsz)?  (3.6) 

fshSzU  =  {|[1  ~  e]  ~  UShSzl  ~  ([1  -  _  UShSz)}2  (3-7) 

Note  that  fszu  =  0  gives  iisz  <  1  and>  therefore,  0  <  r  <  2.  Thus  the  metric  tends  to 
both  untangle  and  to  encourage  r  =  1.  In  contrast,  the  Untangle-Beta  metric  only  tends  to 
untangle.  Similarly,  the  second  metric  should  tend  to  untangle  and  to  encourage  well-shaped 
elements.  Proposition  3.2.3  shows  that  the  use  of  the  Shape  metric  /ish  in  fip)  would  be 
unproductive. 

3.4.  Numerical  Experiments  with  the  New  Untangle  Metrics.  First,  the  SizeUntangle 
quality  metric  (fszu)  was  applied  to  the  four  tangled  test  meshes,  with  a  constant  value  of 
epsilon  =  0.01.  The  optimized  meshes  are  shown  in  Figure  3.3.  Not  surprisingly,  the  results 
are  similar  to  the  results  obtained  by  the  Size  metric  shown  in  Figure  3.1.  Results  for  the 
Hole-in-Square  differ  the  most,  with  the  SizeUntangle  results  looking  less  close  to  being 
inverted.  In  fact,  the  SizeUntangle  metric  was  able  to  untangle  all  four  meshes. 


(a)  Horseshoe 


(b)  Shest  Grid 


(c)  Hole-in-Square 


(d)  Inverted  Hole 


Fig.  3.3.  All  four  meshes  after  optimization  with  fszu- 
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(b)  Shest  Grid 


(a)  Horseshoe 


(c)  Hole-in-Square 


(d)  Inverted  Hole 


Fig.  3.4.  All  four  meshes  after  optimization  with  fshSzU • 


In  Figure  3.4,  the  optimal  meshes  resulting  from  the  ShapeSizeUntangle  metric  ( fshSzU ) 
are  shown.  These  meshes  look  very  similar  to  those  obtained  by  applying  the  ftshSz  metric 
(Figure  3.2).  The  one  noticeable  difference,  as  the  following  data  will  further  detail,  is  that 
our  new  metric  reduced  the  Hole-in-Square  mesh  to  4  of  its  original  23  inverted  elements, 
while  the  ShapeSize  metric  only  obtained  a  result  of  12.  Overall,  due  to  the  untangling  of 
the  Hole-in-Square  mesh,  the  SizeUntangle  metric  performed  better  -  as  an  untangler  -  than 
the  ShapeSizeUntangle  metric.  As  was  noted  in  Section  3.1,  Size  was  able  to  reduce  Hole- 
in-Square  to  6  inverted  elements,  while  ShapeSize  only  managed  12.  Perhaps  this  provides 
insight  as  to  why  the  SizeUntangle  metric  was  able  to  untangle  the  mesh  when  the  Shape¬ 
SizeUntangle  metric  could  not. 

Several  items  to  note  from  Table  3.1:  The  Shape  metric  performed  the  same,  in  terms 
of  untangling,  as  Laplace  smoothing.  Comparing  the  results  of  the  two  f  vs.  their  f(jj) 
counterparts,  we  see  that  the  latter  did  prove  somewhat  more  effective  in  untangling  meshes. 
The  original  fp  metrics  remains  highly  competitive.  Better  shaped  elements  are  produced  by 
the  SizeShape  and  SizeShapeUntangle  metrics  when  compared  to  the  Size  and  SizeUntangle 
metrics,  although  the  latter  are  perhaps  superior  in  terms  of  ability  to  untangle. 

For  completeness  we  also  optimized  the  four  meshes  using  fshUntangie ,  even  though  this 
is  not  supported  by  a  Proposition  similar  to  3.1.1  and  3.1.2.  Results  were  the  same  as  for  fish 
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Table  3.1 

Summary  of  Whether  a  Given  Method  Was  Able  to  Untangle  a  Given  Mesh 
T  means  ’ not  able  to  untangle  U  means  ’able  to  untangle  ’ 


Horseshoe 

Shest 

Hole-in-Sq 

Inv.  Hole 

#T 

Initial  Mesh 

T 

T 

T 

T 

4 

Laplace 

T 

U 

T 

U 

2 

Up 

U 

U 

U 

U 

0 

VSz 

U 

u 

T 

U 

1 

fszU 

U 

u 

U 

U 

0 

VShSz 

U 

u 

T 

U 

1 

fshSzU 

U 

u 

T 

U 

1 

USh 

T 

u 

T 

U 

2 

#T 

2 

0 

5 

0 

in  terms  of  ability  to  untangle. 

Table  3.2  gives  the  values  of  Tmin  resulting  from  the  various  optimizations.  As  one  can 
see,  the  Hole-in-Square  has  the  smallest  rmin  values,  and  shows  the  relative  success  of  the 
various  methods  in  untangling  it. 


Table  3.2 

Minimum  r  value  for  each  mesh  before  and  after  optimization  with  the  TMOP  metrics. 


Horseshoe 

Shest 

Hole-in-Square 

Inverted  Hole 

Initial 

-0.178098 

-0.315243 

-0.957579 

-1.47294 

USz 

0.804821 

0.253687 

-0.188222 

0.672569 

fszU 

0.790504 

0.270406 

0.0570835 

0.717553 

UShSz 

0.354348 

0.192078 

-0.901212 

0.736087 

fshSzU 

0.295828 

0.169121 

-0.312364 

0.735972 

USh 

-0.178098 

0.193576 

-0.957579 

0.735989 

Table  3.3 

Initial  Mesh  (I)  and  Optimized  Mesh  (O) 


Number  of  Inverted  Elements 


Horseshoe 

Shest 

HoleSq 

Inv  Hole 

I 

O 

I 

O 

I 

O 

I 

O 

Size 

12 

0 

9 

0 

23 

6 

40 

0 

SizeUntangle 

12 

0 

9 

0 

23 

0 

40 

0 

ShapeSize 

12 

0 

9 

0 

23 

12 

40 

0 

ShapeSizeUntangle 

12 

0 

9 

0 

23 

4 

40 

0 

Shape 

12 

12 

9 

0 

23 

23 

40 

0 

4.  Conclusions.  A  new  mesh  optimization  strategy  for  untangling  2D  meshes,  based  on 
node-movement,  has  been  investigated.  The  strategy  relies  on  a  set  of  Propositions  which 
show  that,  for  certain  quality  metrics  0  <  /i  <  oo  within  the  Target-matrix  paradigm,  if  /i  <  1 
then  the  local  area  is  positive.  The  Propositions  were  exploited  in  devising  a  new  strategy 
for  mesh  untangling  that  can  also  incorporate  other  goals  such  as  Size  or  ShapeSize  improve¬ 
ment  (i.e.,  the  new  metrics  are  not  pure  untanglers).  In  this  way,  the  new  strategy  is  similar 
in  purpose  (but  not  in  approach)  to  the  method  in  [2].  Numerical  results  showed  that  the  new 
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Table  3.4 

Initial  Mesh  (O)  and  Optimized  Mesh  (O) 


Number  of  Invalid  Points 


Horseshoe 

Shest 

HoleSq 

Inv  Hole 

I 

O 

I 

O 

I 

O 

I 

0 

Size 

44 

0 

9 

0 

70 

6 

156 

0 

SizeUntangle 

44 

0 

9 

0 

70 

0 

156 

0 

ShapeSize 

44 

0 

9 

0 

70 

31 

156 

0 

ShapeSizeUntangle 

44 

0 

9 

0 

70 

4 

156 

0 

Shape 

44 

44 

9 

0 

70 

70 

156 

0 

metrics  (especially  SizeUntangle)  performed  reasonably  well  when  compared  to  the  authors’ 
previous  Untangle-Beta  metric  (additional  tangled  meshes  are  needed  to  confirm  this  conclu¬ 
sion).  Significantly,  the  new  metrics  do  not  require  that  one  specify  a  parameter  similar  to  the 
awkward  y 3  parameter.5  It  is  unfortunate  that  a  useful  Proposition  for  the  Shape  metric  does 
not  exist  since  if  it  did,  one  could  avoid  having  to  construct  proper  values  for  A  in  the  Target- 
matrices,  because  the  Shape  metric  is  Size-invariant.  Future  work  within  this  new  strategy 
might  include  more  sophisticated  models  for  the  Size  parameter  A  in  the  Target-matrix  to 
allow  untangling  of  meshes  having  heterogeneously- sized  elements.  Work  on  extending  this 
approach  to  3D  meshes  has  already  commenced.  Finally,  like  all  the  other  untangling  algo¬ 
rithms  appearing  in  the  literature,  the  new  metrics  do  not  guarantee  an  untangled  result. 
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those  given  in  Section  3.2.  This  metric  was  not  used  in  this  study  because  it  requires  one 
to  include  Orientation  information  within  the  Target-matrix  construction  step,  and  it  is  not 
obvious  what  that  information  should  be  in  mesh  untangling. 

REFERENCES 

[1]  R  Barrera-Sanchez  and  J.  Tinoco-Ruiz,  Smooth  and  Complex  Grid  Generation  over  General  Plane  Regions, 

Mathematics  and  Computers  in  Simulation,  46  (1998),  pp.  87-102. 

[2]  J.  Escobar,  E.  Rodriguez,  R.  Montenegro,  G.  Montero,  and  J.  Gonzalez-Yuste,  Simultaneous  Untangling 

and  Smoothing  of  Tetrahedral  Meshes,  Computer  Methods  in  Applied  Mathematics  and  Engineering, 

192  (2003),  pp.  2775-2787. 

[3]  L.  Freitag  and  P.  Plassmann,  Local  Optimization-Based  Simplicial  Mesh  Untangling  and  Improvement,  Inti. 

Journal  of  Numerical  Methods  in  Engineering,  49,  pp.  109-125. 

[4]  - ,  Local  Optimization-Based  Untangling  Algorithms  for  Quadrilateral  Meshes,  (2001). 

[5]  P.  Knupp,  Local  3d  metrics  for  mesh  optimization  in  the  target-matrix  paradigm,  manuscript. 

[6]  - ,  Hexahedral  Mesh  Untangling  and  Algebraic  Mesh  Quality  Metrics,  9th  International  Meshing 

Roundtable,  (2000),  pp.  173-183. 

[7]  - ,  Hexahedral  and  Tetrahedral  Mesh  Untangling,  Engineering  with  Computers,  17  (2001),  pp.  261- 

268. 

[8]  - ,  Local  2d  Metrics  for  Mesh  Optimization  in  the  Target-matrix  Paradigm,  SAND2006-7382J,  Sandia 

National  Laboratories,  (2006). 

[9]  - ,  Introducing  the  target-matrix  paradigm  for  mesh  optimization  via  node-movement,  Proceedings  of 

the  19th  International  Meshing  Roundtable,  (2010). 

[10]  P.  Vachal,  R.  Y.  Garimella,  and  M.  J.  Shashkov,  Mesh  Untangling,  Los  Alamos  National  Laboratory  -  Sum¬ 

mer  report,  (2002). 

[11]  A.  Winslow,  Numerical  solution  of  the  quasilinear  poisson  equations  in  a  nonuniform  triangle  mesh,  J.  Comp. 

Phys.,  2  (1967),  pp.  149-172. 


5  The  parameter  e  in  f(p)  is  easier  to  choose  than  / 3  since  it  only  needs  to  be  small  compared  to  1,  whereas  / 3 
must  be  small  compared  to  r.  Moreover,  A  can  be  constant  over  the  mesh  even  if  the  mesh  is  Size-heterogeneous. 


CSRI  Summer  Proceedings  2010 


166 


STATIC  VERTEX  REORDERING  SCHEMES  FOR  LOCAL  MESH  QUALITY 

IMPROVEMENT 

JEONGHYUNG  PARK11,  PATRICK  KNUPP**  AND  SUZANNE  M.  SHONTZ;  ; 

Abstract.  Numerical  experiments  were  conducted  to  investigate  how  timings  in  a  local  mesh  Laplacian- 
smoothing  algorithm  vary  as  a  function  of  the  ordering  of  the  vertices  (and  patches)  in  the  mesh.  Timings  varied  by 
roughly  a  factor  of  2,  depending  which  of  the  20  orderings  was  used.  Sensitivity  of  these  results  as  a  function  of 
mesh  size,  mesh  quality,  and  termination  criteria  was  investigated.  No  particular  ordering  was  best  in  all  situations, 
although  certain  orderings  appear  to  be  viable  candidates  for  an  all-purpose  ordering. 

1.  Introduction.  The  mesh  and  its  quality  can  greatly  impact:  (1)  the  accuracy  of  the 
PDE  solution  and  (2)  the  conditioning,  stability,  and  efficiency  of  the  associated  PDE  solver. 
[11,  7].  The  quality  of  the  mesh  can  be  improved  by  adaptivity  [1,  6],  smoothing  [8,  2],  or 
swapping  [4,  5].  Mathematically  rigorous  mesh  smoothing  is  performed  via  optimization  of 
an  objective  function  which  measures  the  mesh  quality.  In  mesh  smoothing,  vertex  movement 
strategies  are  applied  in  order  to  change  the  mesh  vertex  coordinates,  while  maintaining  the 
connectivity  of  the  initial  mesh.  In  this  paper,  we  focus  on  static  vertex  reordering  schemes 
within  the  context  of  local  mesh  optimization  in  order  to  improve  mesh  quality. 

1.1.  Related  Literature.  Shontz  and  Knupp  previously  performed  research  on  vertex 
reordering  schemes  within  the  context  of  local  mesh  optimization  [12].  Key  findings  from 
their  paper  [12]  are  as  follows: 

1.  Vertex  reorderings  were  most  helpful  when  the  initial  mesh  is  far  from  optimal. 

2.  Vertex  reorderings  were  most  helpful  when  the  initial  vertex  ordering  is  poor. 

3.  Dynamic  vertex  reordering  schemes  were  too  expensive  relative  to  static  vertex  re¬ 
ordering  schemes. 

These  results  form  the  motivation  for  the  current  study  on  static  vertex  reordering  schemes 
within  the  context  of  local  mesh  optimization. 

Other  researchers  have  also  performed  research  vertex  reordering  within  the  context  of 
mesh  smoothing  performed  via  optimization.  However,  it  should  be  noted  that  the  papers  to 
be  discussed  below  develop  results  on  vertex  reordering  schemes  applied  to  improving  the 
efficiency  of  the  mesh  optimization  via  improvements  in  the  cache  performance  on  a  spe¬ 
cific  computer  architecture.  In  contrast,  we  are  developing  general-purpose  techniques,  i.e., 
schemes  which  are  independent  of  the  computer  architecture. 

For  example,  Munson  [9]  and  Munson  and  Hovland  [10]  developed  a  Feasible  Newton 
mesh  optimization  algorithm  and  benchmark.  Their  algorithm  and  benchmark  employed  both 
data  and  iteration  reordering  in  order  to  improve  cache  performance.  One  finding  of  their  re¬ 
search  was  that  reordering  of  the  input  data  can  increase  or  decrease  the  number  of  iterations 
taken  by  the  inexact  Newton  method  and  can  affect  its  success  or  failure  [9] .  The  reordering 
applied  was  a  reordering  of  the  vertices  and  elements  in  the  mesh  by  applying  a  breadth-first 
search  and  reversing  the  order  in  which  the  vertices  were  visited.  When  data  and  iteration 
ordering  were  performed  on  the  relevant  hypergraphs,  the  reorderings  were  found  to  signif¬ 
icantly  decrease  the  number  of  cache  misses  in  all  phases  of  code  execution  and  resulted  in 
significantly  faster  code  [10]. 

S trout  etal.  [13]  investigated  six  data  and  iteration  reordering  schemes  and  applied  them 
to  tetrahedral  meshes  before  they  were  optimized  using  FeasNewt.  With  each  reordering 
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scheme,  the  vertices  and  elements  of  the  mesh  were  reordered  using  a  hypergraph  model. 
The  six  orderings  considered  in  their  paper  were:  the  (1)  null  ordering,  (2)  Hyper-BFS  (a 
breadth  first  search  on  hypergraph),  (3)  Hyper-CPack  (consecutive  packing  on  hypergraph), 
(4)  Hyper-Part  (hypergraph  partitioning),  (5)  HierBFS  (Hierarchical  BFS)  and  (6)  HierCPack 
(Hierarchical  consecutive  packing).  They  saw  up  to  40%  better  performance  than  the  original 
ordering.  Their  results  show  that  hierarchical  data  reordering  improves  over  the  performance 
of  local  reordering  strategies.  Performance  metrics  of  interest  were  overall  execution  time 
as  well  as  various  types  of  cache  misses  (LI,  L2,  and  TLB).  This  paper  lists  two  possibili¬ 
ties  for  future  research:  (1)  show  that  a  particular  ordering  will  never  slow  down  a  particular 
algorithm  if  applied  and  (2)  automatic  determination  of  the  best  ordering. 

1.2.  Vertex  Reordering  Study.  For  the  current  study,  we  introduce  vertex  reordering 
of  the  interior  vertices  within  the  context  of  the  local  Laplacian  smoothing  procedure.  Many 
other  methods  for  mesh  smoothing  exist,  but  Laplacian  was  selected  for  this  study  because  it 
is  simple  and  well-understood.  Two  classes  of  reordering  techniques  were  suggested  in  [12]: 
static  vertex  reordering  and  dynamic  vertex  reordering.  In  the  static  case,  the  initial  ordering 
of  the  vertex  list  provided  to  the  mesh  optimization  algorithm  is  reordered  by  using  some 
criteria.  The  reordered  list  is  used  and  kept  fixed  during  the  optimization.  In  the  dynamic 
case,  the  reordered  list  is  updated  at  the  end  of  each  iteration  within  the  optimization,  thus 
creating  a  sequence  of  vertex  lists.  In  this  paper,  we  focus  on  static  reordering  of  the  vertices. 
The  quality  of  the  mesh  is  measured  according  to  a  normalized  Laplacian  mesh  quality  metric. 
The  principal  goal  of  this  study  is  to  determine  the  key  variables  that  affect  the  success  of 
the  reordering  methods  as  determined  by  the  performance  of  the  underlying  Laplacian  mesh 
smoothing  routine. 

We  specify  the  pseudocode  for  the  local  mesh  optimization  procedure  which  identifies 
when  vertex  reordering  is  performed  and  also  the  details  of  the  timing.  Further  details  on  the 
timing  metrics  of  interest  will  be  described  in  the  Numerical  Experiments  section. 

1.  Initialization:  vertex  list,  tolerance  values, 

termination  criterion,  objective  function, 
control  parameters. 

--  CPU  Timer_l 

2.  While  tolerance  not  satisfied  and  iteration 

count  not  exceeded 

--  CPU  Timer_2 

Loop  over  list  of  free  vertices 

-  update  free  vertex  coordinates 
(via  optimization) 

End  loop  over  free  vertices 
--  CPU  Timer_3 

Reorder  vertex  list  on  first  iteration 
--  CPU  Timer _4 

End  while  loop 

--  CPU  Timer_5 

1.3.  Vertex  Reordering  Study  Design.  For  our  initial  study,  we  fixed  the  following 
variables:  quality  metric,  optimization  template  (linear  averaging),  optimization  solver  (Laplace 
smoothing),  sorter  (Quicksort),  mesh  connectivity,  serial  computation,  architecture  and  vary 
the  mesh  size,  element  type,  element  heterogeneity,  element  isotropy,  solver  convergence  tol¬ 
erance,  and  vertex  reordering  scheme.  Mesquite,  the  Mesh  Quality  Improvement  Toolkit  [3], 
was  used  for  the  study. 
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2.  Mesh  Quality  Improvement  Problem.  Let  q  denote  the  objective  function  as  com¬ 
puted  by  the  Laplacian  metric.  Then  q  =  ^  where  lt  is  the  length  of  the  ith  interior 

edge  and  n  is  the  total  number  of  interior  edges  in  the  mesh.  Define  the  normalized  Laplace 
objective  function  by  qN  =  qfinal_qcur ;  where  qinit  is  the  initial  value  of  the  objective  function, 

Q  final  Qinit 

qcur  is  the  current  value  of  the  objective  function  during  the  optimization,  and  qfinai  is  the 
value  of  the  objective  function  when  the  optimization  is  converged.  The  range  of  the  nor¬ 
malized  objective  function  is  from  0  to  1,  with  0  being  the  optimal  value.  Minimizing  qN  is 
equivalent  to  minimizing  qcur.  The  value  of  q final,  which  establishes  convergence,  was  based 
on  observing  when  qcur  became  nearly  constant  as  the  optimization  progressed. 

3.  Vertex  Reordering  Schemes.  We  apply  vertex  reordering  in  the  above  mesh  opti¬ 
mization  algorithm.  Most  of  the  schemes  reorder  the  vertices  at  the  beginning  of  the  opti¬ 
mization  process.  However,  in  the  case  of  the  vertex  movement-related  schemes,  the  vertex 
reordering  is  executed  after  one  optimization  step  is  complete  so  that  the  distance  between  the 
previous  and  the  current  vertex  coordinates  can  be  computed.  In  this  paper,  static  vertex  re¬ 
ordering  is  considered,  i.e.,  reordering  is  performed  only  once.  The  twenty  vertex  reordering 
schemes  explored  are  as  follows  (taken  from  [12]): 

•  Scheme  (N):  Null  ordering.  Do  not  reorder  the  vertex  list. 

•  Scheme  (R):  Random  Ordering.  Generate  a  random  integer  from  1  to  N,  with  N 
being  the  total  number  of  local  patches  in  the  mesh.  Place  the  vertex  assigned  the 
value  1  first  in  the  list  and  N  last. 

•  Scheme  (WQP):  Ordering  by  Worst  Quality  Patch.  Evaluate  objective  function 
on  a  local  patch  to  measure  local  patch  quality.  Sort  by  putting  the  worst  quality 
patch  first  in  the  list,  and  so  on. 

•  Scheme  (GAVM):  Ordering  by  Greatest  Absolute  Vertex  Movement.  Evaluate 
the  absolute  distance  moved  by  the  free  vertex  in  the  local  patch  before  and  after 
the  local  optimization.  Sort  by  putting  the  patch  with  the  greatest  absolute  distance 
moved  first  in  the  list,  and  so  on. 

•  Scheme  (GRVM):  Ordering  by  Greatest  Relative  Vertex  Movement.  Evaluate 
the  relative  distance  moved  by  the  free  vertex  in  the  local  patch  before  and  after 
the  local  optimization.  Relative  distance  was  measured  by  dividing  the  absolute 
distance-moved  by  the  initial  absolute  distance-moved;  the  normalizing  quantity 
thus  varies  from  patch  to  patch.  Sort  by  putting  the  patch  with  the  greatest  rela¬ 
tive  distance  moved  first  in  the  list,  and  so  on. 

•  Scheme  (LNG):  Ordering  by  Largest  Norm  of  Gradient.  Evaluate  the  £2-norm 
of  the  local  gradient  of  the  objective  function.  Sort  by  putting  the  largest  norm  first 
in  the  list,  and  so  on. 

•  Schemes  (D-WQP,  D-GAVM,  D-GRVM,  and  D-LNG):  Distance  schemes.  Cre¬ 
ated  by  measuring  the  distance  of  the  free  vertex  in  each  patch  from  the  position 
of  the  free  vertex  which  had  the  (i)  worst  patch  quality,  (ii)  greatest  average  vertex 
movement,  (iii)  greatest  relative  vertex  movement,  or  (iv)  largest  norm  of  the  gradi¬ 
ent,  respectively.  Then  sort  by  putting  the  patch  with  the  smallest  distance  first  in 
the  list,  and  so  on. 

•  Schemes  (-N,  -R,  -WQP,  -D-WQP,  -GAVM,  -D-GAVM,  -GRVM,  -D-GRVM,  - 
LNG,  -D-LNG):  Reverse  schemes.  For  each  of  the  schemes  above  there  is  a  corre¬ 
sponding  ordering  scheme  produced  by  reversing  the  ordering.  For  example,  scheme 
-WQP  orders  the  vertex  patches  from  best  quality  first  to  worst  quality  last. 

4.  Numerical  Experiments.  To  determine  the  efficiency  of  the  static  vertex  reordering 
schemes,  four  numerical  experiments  were  designed.  Each  experiment  was  designed  to  illus- 
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trate  the  impact  of  the  following  factors  on  the  efficiencies  of  the  mesh  optimization  algorithm 
when  used  in  conjunction  with  the  various  vertex  reordering  schemes:  the  final  mesh  quality 
desired,  the  mesh  size,  the  initial  mesh  quality,  and  the  initial  vertex  ordering.  The  following 
subsections  will  explain  these  experiments  and  the  associated  results  in  more  detail. 

A  tetrahedral  hook  geometry  was  used  to  create  the  initial  meshes  for  the  experiments 
(Figure  4.1). 


Fig.  4.1.  The  tetrahedral  hook  geometry  used  to  create  the  initial  meshes  for  the  experiments 


Our  mesh  quality  optimization  procedure  minimized  the  objective  function  q,  which  cor¬ 
responds  to  the  overall  mesh  quality.  To  evaluate  the  normalized  metric  q^LM ,  timing  results 
are  reported  for  when  the  mesh  optimization  algorithm  reached  q final  =  2.237975. 

The  timing  results  for  each  scheme  were  unstable  due  to  the  background  tasks  of  the  ma¬ 
chine.  Multiple  runs  were  required  in  order  to  obtain  stable  relative  rankings  of  the  reordering 
schemes.  Because  the  average  time  results  after  five  runs  were  close  to  stable,  five  runs  of 
each  scheme  for  each  experiment  were  performed,  and  their  times  were  averaged  for  each 
scheme.  The  average  total  time  (over  the  5  runs)  of  a  given  scheme  is  defined  as  TT  and  all 
timing  results  for  each  experiment  used  the  TT  values  of  each  reordering  scheme. 

To  characterize  the  variation  in  total  times  over  the  20  schemes,  compared  to  the  total 
time  of  a  typical  scheme,  results  are  reported  as  TT’nax+TTmi»  ±  rr,,M-y~7Tm/w ,  where  TTmin  and 
TTmax  are  the  minimum  and  maximum  total  times  of  all  the  reordering  schemes.  Also,  the 
percent  variation  in  the  total  time  was  computed  by  dividing  — ^ — —  by  — ^ — —  . 

To  effectively  compare  the  timing  results  of  each  scheme,  we  defined / 3  as  follows: 


P  = 


TT  -  TTavg 

TTavg 


(4.1) 


where  TTavg  is  the  average  total  time  of  all  the  reordering  schemes.  Note  that  if  (3  >  0  for  a 
particular  scheme,  the  scheme  is  slower  than  average.  Similarly,  if  fd  <  0,  the  scheme  is  better 
than  average.  The  deviation  in  / 3  used  for  each  experiment  represented  the  value  farthest  from 
/?  -  0. 

All  four  experiments  were  performed  with  a  multicore  machine  equipped  with  two  pro¬ 
cessors:  a  4  quad-core  AMD  Shanghai  processor  running  at  2.7  GHz  with  96  GB  of  RAM 
and  a  2  quad-core  Intel  Nehalem  processor  running  at  2.66  GHz  with  24  GB  of  RAM.  The 
performance  of  each  run  was  very  sensitive,  because  the  machine  was  multicore;  thus,  there 
was  memory  contention  between  processors.  Although  five  runs  of  each  scheme  for  each  ex¬ 
periment  were  performed  to  obtain  stable  timing  results,  the  timing  results  might  be  different 
if  the  experiment  environment  is  changed. 

4.1.  Experiment  1:  Sensitivity  to  Outer  Termination  Criterion  Tightness.  The  purpose 
of  this  experiment  was  to  observe  the  behavior  of  the  mesh  optimization  algorithm  when  used 
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in  conjunction  with  the  various  reordering  schemes  when  different  outer  termination  values 
were  set.  The  mesh  optimization  algorithm  was  terminated  based  on  the  normalized  Lapla- 
cian  metric  (NLM).  In  this  experiment,  four  NLM  values  were  considered,  i.e.,  0.75,  0.5, 
0.25,  and  0,  as  termination  values.  The  first  three  cases  were  used  to  observe  the  behavior  of 
each  scheme  when  a  relatively  inaccurate  solution  is  pursued,  whereas  the  last  case  was  used 
to  observe  the  behavior  when  an  accurate  solution  is  desired.  After  the  mesh  quality  reached 
the  desired  NLM  value,  the  mesh  optimization  algorithm  terminated,  and  the  corresponding 
timing  information  was  recorded. 


Ranking  vs.  Time  (NLM=0.75) 


Ranking  vs.  Time  (NLM=0.5) 


Time  (sec) 

(b)  NLM=0.5 


(c)  NLM=0.25 


(d)  NLM=0 


Fig.  4.2.  Ranking  vs.  total  time  for  the  mesh  optimization  algorithm  when  used  in  conjunction  with  the  various 
vertex  reordering  schemes  with  various  NLM  values  in  Experiment  1. 


Figure  4.2  shows  the  ranking  results  for  each  scheme  in  Experiment  1.  The  overall 
scheme  rankings  were  relatively  stable  although  the  rankings  for  some  schemes,  such  as 
GRVM,  -LNG,  and  -D-GAVM,  changed  slightly  for  NLM=0.  The  mesh  optimization  algo¬ 
rithm  converged  faster,  independent  of  the  NLM  value,  when  most  of  the  schemes  other  than 
N  were  used  for  reordering.  Thus,  vertex  reordering  improves  the  efficiency  of  the  mesh  op¬ 
timization  algorithm. 

In  this  experiment,  the  particular  choice  of  reordering  scheme  is  important.  It  was  ob¬ 
served  that  the  time  difference  between  the  best  and  the  worst  ranked  schemes  was  reasonably 
large.  The  timing  results  were:  72.07+43.57  seconds  for  NLM=0.75,  86.86+52.42  seconds 
for  NLM-0.5,  101.64+61.27  seconds  for  NLM=0.25,  and 

2177.24+1225.05  seconds  for  NLM=0.  That  is,  for  NLM=0.75,  0.5,  0.25,  and  0,  the  percent 
variations  in  total  time  were  60.45%,  60.26%,  60.29%,  and  56.27%,  respectively.  The  reason 
for  the  large  difference  in  the  variation  in  the  total  time  was  the  relatively  poor  performance 
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of  the  scheme  N.  The  total  time  of  scheme  N  (3100  seconds)  was  approximately  twice  that 
of  the  other  reordering  schemes  (1520  seconds).  Excluding  this  outlier,  the  percent  variation 
in  the  total  time  was  35%  which  was  still  significant.  Therefore,  appropriate  selection  of  the 
reordering  scheme  was  required  to  converge  quickly. 

Similarly,  the  deviation  in  p  (4.1)  was  relatively  noticeable.  Figure  4.3  shows  the  values 
of  p  for  each  scheme  in  Experiment  1 . 


Beta  vs.  NLM  (forward  schemes)  Beta  vs.  NLM  (reverse  schemes) 


(a)  forward  schemes  (b)  reverse  schemes 

Fig.  4.3.  / 3  vs.  NLM  for  the  20  vertex  reordering  schemes  in  Experiment  1.  Results  for  the  10  forward  schemes: 
N,  R,  WQP,  D-WQP,  GAVM,  D-GAVM,  GRVM,  D-GRVM,  LNG,  and  D-LNG  are  shown  in  (a).  The  results  for  the  10 
reverse  schemes:  -N,  -R,  -WQP,  -D-WQP,  -GAVM,  -D-GAVM,  -GRVM,  -D-GRVM,  -LNG,  and  -D-LNG  are  shown  in 
(b). 


In  all  cases,  the  deviations  in  p  were  greater  than  1.  In  particular,  the  deviations  were: 
1.39  for  NLM=0.75,  1.37  for  NLM=0.5,  1.36  for  NLM=0.25,  and  1.16  for  NLM=0.  This 
was  because  the  value  of  p  for  scheme  N  was  always  greater  than  1 .  In  addition,  the  total  time 
for  scheme  N  was  twice  that  of  the  other  schemes.  Again,  this  demonstrated  the  importance 
of  the  choice  of  vertex  reordering  scheme. 

The  -D-WQP  scheme  ranked  the  best  in  this  experiment  for  all  NLM  values.  GRVM, 
D-LNG,  and  D-GAVM  schemes  also  showed  relatively  good  performance  in  this  experiment. 
Although  small  changes  occurred  in  the  relative  rankings  as  the  NLM  value  decreased,  the 
total  time  to  converge  was  reduced  through  adequate  choice  of  reordering  schemes,  especially 
the  -D-WQP  scheme  for  this  experiment. 

4.2.  Experiment  2:  Mesh  Size.  The  purpose  of  this  experiment  was  to  determine  the 
effect  that  mesh  size  has  on  the  efficiency  of  the  mesh  optimization  algorithm  when  used 
in  conjunction  with  various  vertex  reordering  schemes.  In  this  experiment,  five  different 
sizes  of  meshes  were  used:  5 OK,  75 K,  100K,  125K,  and  15 OK.  Appropriate  mesh  sizes  were 
determined  through  a  preliminary  experiment.  Figure  4.4  shows  the  ranking  vs.  total  time 
results  for  each  of  the  five  mesh  sizes  in  this  experiment. 

The  rankings  for  each  scheme  were  very  sensitive  to  an  increase  in  mesh  size.  Schemes 
WQP  and  -WQP  ranked  very  highly  until  the  mesh  size  increased  to  150K.  However,  when 
the  mesh  size  reached  150K,  these  schemes  ranked  the  worst.  The  time  difference  between 
the  best  and  the  worst  ranked  schemes  was  relatively  noticeable.  For  mesh  sizes  50K,  75 K, 
100K,  125K,  and  150K,  the  timing  results  were:  72+20  seconds,  170+78  seconds,  600+305 
seconds,  1622+853  seconds,  and  2532+804  seconds,  respectively.  The  percent  variations 
in  the  total  time  were:  27%  for  50K,  47%  for  75K,  51%  for  100K,  53%  for  125K,  and 
32%  for  150K.  Excluding  the  outlier,  i.e.,  the  150K  mesh,  the  time  variations  increased  as 
the  mesh  size  increased.  In  this  experiment,  quality-related  reordering  schemes  WQP  and 
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Ranking  vs.  Time  (NLM=0) 


Ranking  vs.  Time  (NLM=0) 


(a)  50K 


(b)  75K 


(c)  100K 


(d)  125K 


Fig.  4.4.  Ranking  vs.  total  time  for  the  mesh  optimization  algorithm  when  used  in  conjunction  with  the  various 
vertex  reordering  schemes  with  various  mesh  sizes  in  Experiment  2. 


-WQP  showed  good  performance  when  combined  with  the  mesh  optimization  algorithm. 
The  deviation  in  fd  (4.1)  also  showed  the  sensitivity  of  the  ranking  results  as  the  mesh  size 
increased  (Figure  4.5). 

The  deviations  in  fd  were:  0.28,  0.65,  0.53,  0.97,  and  0.51  for  mesh  sizes  50K  through 
15 OK,  respectively.  Although  the  deviation  in  fd  for  mesh  size  5 OK  was  small  compared  to 
the  deviation  for  the  other  mesh  sizes,  the  overall  fd  deviation  from  one  mesh  size  to  the  next 
was  discernible. 


4.3.  Experiment  3:  Sensitivity  to  Initial  Mesh  Vertex  Coordinates .  The  purpose  of  this 
experiment  was  to  investigate  the  behavior  of  the  mesh  optimization  algorithm  when  used  in 
conjunction  with  various  reordering  schemes  when  various  meshes  with  different  initial  mesh 
qualities  were  used.  The  input  meshes  differed  from  the  original  mesh  used  in  Experiment 
1  in  their  vertex  coordinates.  Four  meshes  of  various  qualities  were  obtained  by  randomly 
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(a)  forward  schemes  (b)  reverse  schemes 


Fig.  4.5.  /?  vs.  mesh  size  for  the  20  vertex  reordering  schemes  in  Experiment  2.  Results  for  the  10  forward 
schemes  N,  R,  WQP,  D-WQP,  GAVM,  D-GAVM,  GRVM,  D-GRVM,  LNG,  and  D-LNG  are  shown  in  (a).The  results 
for  the  10  reverse  schemes  -N,  -R,  -WQP,  -D-WQP,  -GAVM,  -D-GAVM,  -GRVM,  -D-GRVM,  -LNG,  and  -D-LNG  are 
shown  in  (b). 


perturbing  the  vertex  coordinates  of  the  original  mesh  as  follows: 

Xnew  —  XQrig  +  /fu,  (4.2) 

where  u  is  a  random  unit  vector,  L  is  the  average  of  all  the  edge  lengths  in  the  mesh,  and  / 
is  the  amount  of  perturbation.  In  this  paper,  /  =  0, 0.1, 0.3,  and  0.9.  Note  that  /  =  0  means 
no  perturbation,  and  /  =  0.9  is  a  large  perturbation.  The  perturbations  are  applied  only  to 
the  interior  vertices  in  the  mesh.  Figure  4.6  shows  the  ranking  results  for  each  scheme  in 
Experiment  3. 

As  the  perturbation  size  increased,  the  mesh  optimization  algorithm  required  more  time 
to  converge  for  nearly  all  the  reordering  schemes.  The  time  difference  between  the  best  and 
the  worst  ranked  schemes  was  noticeable.  The  timing  results  were:  1940+1148  seconds  for 
/  =  0,  1440+712  seconds  for  /  =  0.1,  2013+985  seconds  for  /  =  0.3,  and  3775+2270 
seconds  for  /  =  0.9.  That  is,  for  /  =  0,0.1, 0.3,  and  0.9,  the  percent  variations  in  total 
time  were  59%,  50%,  49%,  and  60%,  respectively.  Hence  vertex  reordering  can  reduce  the 
amount  of  time  for  the  mesh  optimization  algorithm  to  converge.  In  this  experiment,  schemes 
D-GAVM  for  /  =  0,  WQP  for  /  =  0.1,  -N  for  /  =  0.3,  and  -D-GRVM  for  /  =  0.9  ranked  the 
best.  The  mesh  optimization  algorithms  with  these  schemes  took  less  than  half  of  the  total 
time  for  the  mesh  optimization  algorithm  with  scheme  N. 

The  plots  of  the  values  of  vs.  perturbation  size  for  each  scheme  are  shown  in  Figure 
4.7.  The  ft  deviations  (4.1)  were:  0.85  for  /  =  0,  0.51  for  /  =  0.1,  0.74  for  /  =  0.3,  and  0.7 
for  /  =  0.9.  The  values  of  ft  for  each  scheme  deviate  from  =  0  more  than  0.5.  The  ranking 
of  each  scheme  was  very  sensitive  with  respect  to  the  different  perturbation  size. 

4.4.  Experiment  4:  Sensitivity  to  Initial  Mesh  Vertex  Ordering.  The  purpose  of  this 
experiment  was  to  explore  the  effect  that  the  initial  vertex  ordering  had  on  the  efficiency  of  the 
mesh  optimization  algorithm  when  used  in  conjunction  with  the  vertex  reordering  schemes. 
The  input  meshes  differed  from  the  original  mesh  used  in  Experiment  1  in  their  initial  vertex 
orderings.  Four  different  reordering  schemes,  i.e.,  LNG,  R,  N,  and  D-LNG  were  applied  to 
the  original  mesh  to  obtain  four  initial  meshes.  The  results  for  this  experiment  are  shown  in 
Figure  4.8. 

When  schemes  LNG  and  R  were  used  for  creating  the  initial  vertex  ordering,  the  mesh 
optimization  algorithm  with  scheme  -D-WQP  required  the  least  amount  of  time  to  complete. 
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Fig.  4.6.  Ranking  vs.  total  time  for  the  mesh  optimization  algorithm  when  used  in  conjunction  with  the  various 
vertex  reordering  schemes  with  various  perturbation  sizes  in  Experiment  3. 


Beta  vs.  perturbation  size  (forward  schemes)  Beta  vs.  perturbation  size  (reverse  schemes) 


(a)  forward  schemes  (b)  reverse  schemes 

Fig.  4.7.  [3  vs.  perturbation  size  for  the  20  vertex  reordering  schemes  in  Experiment  3.  Results  for  the  10 
forward  schemes  N,  R,  WQP,  D-WQP,  GAVM,  D-GAVM,  GRVM,  D-GRVM,  LNG,  and  D -ENG  are  shown  in  (a).  The 
results  for  10  reverse  schemes  -N,  -R,  -WQP,  -D-WQP,  -GAVM,  -D-GAVM,  -GRVM,  -D-GRVM,  -LNG,  and  -D-LNG 
are  shown  in  (b). 


However,  the  rankings  for  the  all  of  schemes  were  very  sensitive  with  respect  to  the  initial 
vertex  ordering  of  the  initial  mesh.  In  this  experiment,  the  time  difference  between  the  best 
and  the  wort  ranked  schemes  was  relatively  large,  i.e.,  over  2000  seconds.  For  the  initial 
vertex  orderings  created  using  schemes  LNG,  R,  N,  and  D-LNG,  the  timing  results  were: 
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(a)  scheme  LNG 


Ranking  vs.  Time  (NLM=0)  Ranking  vs.  Time  (NLM=0) 


(c)  scheme  N 


(d)  scheme  D-LNG 


Fig.  4.8.  Ranking  vs.  total  time  for  the  mesh  optimization  algorithm  when  used  in  conjunction  with  the  various 
vertex  reordering  schemes  applied  to  meshes  with  different  initial  vertex  orderings  in  Experiment  4. 


2173+1183  seconds,  2006+1055  seconds,  2177+1255  seconds,  and  1920+1084  seconds,  re¬ 
spectively.  That  is,  the  percent  variations  in  total  time  were:  54%  for  the  LNG  initial  ordering, 
53%  for  the  R  initial  ordering,  56%  for  the  N  initial  ordering,  and  56%  for  the  D-LNG  initial 
ordering.  Figure  4.9  also  shows  the  sensitivity  of  the  ranking  results  for  this  experiment. 


Beta  vs.  initial  ordering  (forward  schemes) 


(a)  forward  schemes 


Beta  vs.  initial  ordering  (reverse  schemes) 


(b)  reverse  schemes 


Fig.  4.9.  [4  vs.  the  initial  vertex  ordering  for  the  20  vertex  reordering  schemes  in  Experiment  4.  Results  for  the 
10  forward  schemes  N,  R,  WQP,  D-WQP,  GAVM,  D-GAVM,  GRVM,  D-GRVM,  LNG,  and  D-LNG  are  shown  in  (a). 
The  results  for  the  10  reverse  schemes  -N,  -R,  -WQP,  -D-WQP,  -GAVM,  -D-GAVM,  -GRVM,  -D-GRVM,  -LNG,  and 
-D-LNG  are  shown  in  (b). 
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The  deviation  in  f3  (4.1)  as  a  function  of  initial  vertex  ordering  was  very  significant.  The 
maximum  value  of  was:  1.33  for  the  LNG  initial  ordering,  1.11  for  the  R  initial  ordering, 
1.16  for  the  N  initial  ordering,  and  1.06  for  the  D-LNG  initial  ordering.  Even  scheme  N  had 
the  maximum  value  of  for  the  N  and  D-LNG  initial  vertex  orderings. 

5.  Conclusions.  Numerical  experiments  were  conducted  on  a  local-patch 
Laplacian-smoothing  algorithm  to  assess  the  effect  of  the  ordering  of  the  vertices  in  a  tetrahe¬ 
dral  mesh  on  the  CPU  time  to  solution.  Twenty  orderings,  based  on  mesh  quality,  gradients, 
and  movement,  were  created  from  the  initial  mesh  for  use  in  a  set  of  four  experiments.  The 
experiments  were  designed  to  explore  the  sensitivity  of  the  timing  results  to  (1)  whether  or 
not  an  ’accurate’  solution  was  found,  (2)  the  mesh  size,  (3)  the  ’distance’  between  the  initial 
and  optimal  meshes,  and  (4)  the  initial  vertex  orderings.  The  results  reveal  that  the  time-to- 
solution  within  each  experiment  varied  significantly  (about  50%)  over  the  twenty  orderings, 
showing  that  it  does  matter  which  ordering  is  used.  Scheme  N  results  in  the  initial  ordering 
that  is  supplied  by  the  mesh  generator;  that  scheme  ranked  20th  (worst)  in  Experiment  1,  no 
greater  that  14th  in  Experiment  2  (except  for  the  150K  mesh),  and  no  greater  that  10th  in 
Experiment  3,  leading  to  the  conclusion  that  re-ordering  is  often  better  than  using  the  initial 
ordering  from  the  mesh  generator. 

In  the  previous  paper  [12],  one  of  the  conclusions  was  that  the  farther  the  initial  guess 
is  from  optimal,  the  larger  the  variation  in  the  total  time  of  the  mesh  optimization  algorithm, 
and  thus  the  more  significant  the  choice  of  reordering  scheme  becomes.  A  similar  trend  was 
obtained  in  the  results  of  Experiment  3.  When  the  perturbation  size  /  =  0.9  was  applied,  the 
timing  results  showed  more  variation  than  other  smaller  perturbation  size  cases.  However, 
one  could  not  see  that  vertex  reorderings  were  most  useful  when  the  initial  vertex  ordering 
was  poor  through  Experiment  4. 

In  Experiment  1,  the  rankings  of  the  schemes  were  remarkably  stable  as  the  ’accuracy’ 
of  the  solution  was  varied,  with  -D-WQP  giving  the  smallest  timings.  Rankings  were  notably 
less  stable  in  the  other  experiments.  Scheme  WQP  ranked  well  in  the  mesh- size  experiments 
(from  50K  to  125K  vertices);  however,  scheme  -D-WQP,  which  ranked  well  in  Experiment  1, 
did  not  stand  out  in  Experiment  2.  Scheme  GAVM  ranked  well  in  Experiment  3  for  /  <  0.9. 
Scheme  -D-WQP  again  ranked  well  in  Experiment  4.  Consequently,  although  no  particu¬ 
lar  scheme  did  well  under  all  circumstances,  the  after-mentioned  schemes  might  be  worth 
considering  if  static  re-ordering  is  to  be  applied  prior  to  local-patch  Laplacian-smoothing. 

6.  Future  Work.  Reducing  the  variability  in  the  multicore  timing  results  is  planned  in 
the  future.  Furthermore,  additional  experiments  will  be  performed  to  examine  the  perfor¬ 
mance  of  the  reordering  schemes  within  other  contexts  such  as  aspect  ratio  and  boundary 
vertex  reordering. 
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MULTIFRACTAL  DIMENSIONS  USING  MAXIMAL  SIMPLICES 
AND  PYTHON  EXTENSIONS  TO  TEVA-SPOT 

JESSE  BERWALD*,  DAVID  DAY’,  SCOTT  MITCHELL*,  AND  AFRA  ZOMORODIAN§ 

Abstract.  This  article  briefly  summarizes  two  projects,  the  first  on  nonlinear  dynamical  systems  and  the  second 
on  discrete  optimization  software. 

Nonlinear  dynamical  systems,  which  are  integral  to  a  wide  range  of  fields  from  biological  modeling  to  climate 
studies,  often  exhibit  chaotic  behavior.  The  attractors  of  such  systems  give  a  window  into  the  global  behavior  of  the 
systems  themselves.  We  introduce  a  new  method  for  determining  the  multifractal  spectrum  and  dimension  of  the 
attractor  of  a  dynamical  system.  We  show  that  the  tools  developed  for  efficient  analysis  in  the  field  of  computational 
topology  are  well-suited  for  the  task  of  estimating  the  local  density  of  points  in  the  attractor. 

Discrete  optimization  problems  are  found  in  a  great  variety  of  fields,  such  as  architecture  (e.g.,long  term  site 
plans  over  many  sites)  to  water  planning  and  management  to  nuclear  weapon  stewardship.  We  describe  below  an 
extension  to  the  TEVA-SPOT  suite  developed  by  SNL  and  the  EPA.  We  focus  on  the  Pareto  front  of  optimal  solutions 
to  a  pair  of  objectives  and  consider  the  near-optimal  solutions  dominated  by  these. 


1.  Introduction:  Simplicial  Measures.  The  discovery  of  strange  attractors ,  such  as 
the  Lorenz  attactor  and  other  chaotic  systems,  illuminated  an  incredibly  rich  structure  in  rela¬ 
tively  simply  physical  systems.  Subsequent  study  showed  that  the  attractors  of  such  systems 
are  complex  enough  to  have  a  continuum  of  scaling  exponents  [6,  12,  18].  This  is  in  contrast 
to  the  Cantor  set,  for  instance,  which  has  a  single  scaling  exponent  [6] . 

The  concept  of  dimension  can  be  generalized  past  the  usual  notions  of  length,  area,  and 
volume.  Many  methods  have  been  developed  to  approximate  the  dimension  of  the  attractor 
of  a  dynamical  system.  The  two  that  we  will  focus  on  in  this  work  are  the  box-counting  di¬ 
mension,  also  known  as  the  Minkowski-Bouligand  dimension,  and  the  Hausdorff  dimension, 
which  is  estimated  using  the  partition  function  and  methods  derived  from  thermodynamics 
[6,  14,  17]. 

In  order  to  approximate  the  dimension  of  an  attractor  computationally,  one  needs  a  way  to 
tease  out  the  invariant  measure  on  the  attractor  from  a  finite  set  of  data  points  [5].  These  data 
are  assumed  to  have  been  sampled  from  the  distribution  of  points  on  the  attractor  (determined 
by  the  invariant  measure),  either  numerically  by  running  a  computer  simulation  for  a  “long 
time”  or  through  suitable  measurement  of  a  real  physical  system.  Understanding  the  invariant 
measure  is  difficult  because  it  often  has  a  highly  irregular  distribution  on  the  attractor  and  can 
only  be  studied  using  the  finite  data  set. 

We  describe  a  new  simplicial  measure  that  gives  an  accurate  approximation  of  the  mass 
distribution  on  the  attractor  and  allows  one  to  compute  the  fractal  dimension  as  well  as  the 
multifractal  spectrum  of  the  attractor.  A  simplicial  measure  is  defined  on  a  subset  of  simplices 
in  a  Vietoris-Rips  simplicial  complex  that  has  been  constructed  from  a  finite  data  set  sampled 
from  the  attractor.  This  subset  is  composed  of  the  maximal  simplices  in  the  complex.  These 
structures  provide  detailed  information  about  the  density  of  points  in  the  data  set,  allowing 
one  to  approximate  the  invariant  measure  on  the  attractor  itself. 

1.1.  Dynamical  Systems.  In  this  section  we  recall  the  definition  of  a  dynamical  system 
as  well  as  that  of  an  attractor.  A  dynamical  system  specifies  a  rule  that  describes  how  one 
state  of  a  system  evolves  into  another  over  time.  We  describe  this  situation  for  discrete-time 
systems  on  a  topological  space  X,  the  state  space.  (For  a  more  thorough  review  including 
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continuous  time  systems  see  [10,  18].)  Given  a  map  /  :  X  — >  X  the  evolution  of  the  system 
is  defined  by  xt+\  =  f(xt),  where  xt  :=  f(x)  and  t  €  N  or  Z.  An  attractor  is  a  set  A  c  X  to 
which  a  dynamical  system  evolves  over  time.  We  state  this  precisely  as  follows:  A  compact 
set  A  c  X  is  an  attractor  for  /  if  there  exists  a  neighborhood  V  of  A  and  N  e  N  such  that 
fN(V)  c  V  and  A  =  HneN  /"(V).  We  illustrate  the  above  with  an  example. 

1.1.1.  Example  (Cantor  set).  The  Cantor  set,  F,  is  the  prototypical  example  of  an  at¬ 
tractor  of  a  dynamical  system  that  is  also  fractal.  We  describe  a  construction  of  the  Cantor 
set  for  which  F  is  the  attractor  of  an  iterated  function  system  (IFS)  [6]  defined  in  terms  of  a 
simple  dynamical  system.  Define  the  map  /  :  R  —>  R  by 

/(*)  =  |(l-|2x-l|).  (1.1) 

This  is  often  referred  to  as  a  tent  map ,  see  Figure  1.1. 


Fig.  1.1.  The  tent  map  defined  by  f.  Intersection  of  the  iterations  of  the  branches  of  the  inverse,  /o  and  f\,  give  F. 

Now  define  an  IFS  by  the  contractions  /o,  /i  :  [0, 1]  — >  [0, 1]  such  that 

fo(x)  =  |  fi(x)  =  1  -  (1.2) 

The  equations  (1.2)  are  the  two  branches  of  f~l  since 

f(fo(x))  =  x  =  f(Mx))  (1.3) 

Let  E  =  [0, 1]  and  define  g(V)  :=  /o(V)  U  where  V  is  a  nonempty  compact  subset 

of  E.  Using  the  Contraction  Mapping  Theorem  we  can  conclude  that  there  exists  a  unique, 
nonempty,  compact  attractor  in  E  corresponding  to  the  Cantor  set.  We  can  define  F  in  terms 
of  the  IFS  such  that  F  =  PlJJo  From  (1.3)  we  see  that  f(F)  =  F. 

We  detail  some  of  /’  s  behavior  on  F.  If  v  e  R  \  E,  then  fk(x)  diverges  to  -oo  as  k  — >  oo. 
Similarly,  if  x  e  E  \  F,  then  there  exists  some  N  >  0  such  that  v  i  U{/0  o  fix  o  •  •  •  o  fiN(E)}, 
where  ij  e  {0, 1}.  Therefore,  running  the  system  forwards  in  time  we  see  that  fN(x)  £  F.  This 
shows  that  F  is  a  repellor. 
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Multifractals  and  SPOT  Extensions 


To  aid  in  the  discussion  in  Section  1.3  we  summarize  the  symbolic  dynamics  viewpoint 
of  /.  The  sequence  of  inverses  provide  a  coding  of  the  points  in  F  as  follows:  Let  x  e  F,  then 

oo 

x  =  xhh...  =  Pi  fu  O  fh  o  fij(E).  (1.4) 

1=0 

Note  that  this  provides  an  alternative  way  to  write  F ,  namely  F  =  \J{xili2...}.  Now,  it  follows 
from  (1.4)  that  -  xpi'...\  <  3~k  whenever  f  =  i'v  . . . ,  4  =  i'k-  Applying  /  to  a  point  in  F 
we  see  that  f(xili2...)  =  xi2iy...  To  see  that  orbits  in  F  are  dense  under  /,  consider  that  for  any 
v  =  xixi2...  e  F,  there  is  a  k  such  that  i\  -  4+i,  h  =  4+2?  ■■■,//  =  h+u  from  which  it  follows 
that  \x  -  fk(x)\  <  3~l.  From  this  it  follows  that  periodic  points  are  also  dense  in  F  as  well. 

Points  in  F  also  exhibit  sensitivity  to  initial  conditions,  showing  that  F  is  a  chaotic  re- 
pellor  for  /.  Alternatively,  it  is  a  chaotic  attractor  for  the  IFS  defined  by  g.  To  see  the 
former,  consider  v  =  and  x'  =  which  are  within  3~k  of  one  another.  Then 

fk(x)  =  xq  €  [0, 1/3]  and  fk(x')  =  x\  e  [2/3, 1].  Hence,  points  that  begin  near  one  another 
can  diverge  under  the  action  of  /. 


Fig.  1.2.  The  ternary  Cantor  set  showing  the  concept  of  self- similar,  or  fractal,  structure.  Shown  is  the  third 
‘level”,  i.e.  F  was  iterated  3  times. 


We  compute  the  first  few  iterations  and  intersections  of  g,  namely  1])?  in 

Figure  1.2.  This  is  the  “third  level”of  the  Cantor  set.  We  will  return  to  the  Cantor  set  below. 

1.2.  Simplices,  Simplicial  Complexes,  and  Simplicial  Sets.  The  simplicial  measure 
that  we  develop  has  its  origins  in  computational  topology.  We  therefore  continue  by  de¬ 
scribing  some  structures  basic  to  topology  and  homology  in  particular.  A  more  thorough 
description  can  be  found  in  [13]. 

A  simplex  is  a  geometric  object  that  generalizes  the  notion  of  a  triangle.  A  triangle  is 
formed  from  three  vertices  all  of  which  are  connected  to  form  a  2-dimensional  convex  hull. 
In  general,  an  ^-dimensional  simplex  is  a  poly  tope  formed  by  the  convex  hull  of  n+ 1  vertices. 
We  denote  a  simplex  <r  by  its  ordered  set  of  vertices  [vo,  Vi, . . . ,  vn\  or  simply  voVi . . .  vn  for 
brevity.  The  dimension  of  cr  we  denote  by  dim(cr).  A  face  r  of  an  ^-dimensional  simplex 
cr  is  also  a  simplex  and  is  defined  as  the  convex  hull  formed  by  a  (non-empty)  subset  of  the 
n  +  1  vertices  of  cr.  A  simplicial  complex  is  a  collection  K  of  simplices  such  that  for  each 
o~i,  o~2  £  K ,  <j\  n  <72  is  a  face  of  both  or  empty;  and  any  face  of  a  simplex  in  K  is  also  a 
simplex  in  K. 

Given  r,  a  face  of  cr,  cr  is  a  coface  of  r.  A  maximal  simplex  is  a  simplex  with  no  proper 
coface  in  K. 

We  now  describe  the  Vietoris-Rips  (VR)  complex.  We  begin  with  the  Vietoris-Rips  neigh¬ 
borhood  graph.  Given  a  finite  set  of  points  Y  c  Rm  and  a  real  value  e  >  0,  the  VR  neigh¬ 
borhood  graph,  Ne  =  Ne(Y ),  is  the  neighborhood  graph  composed  of  nodes  (points)  from  Y, 
and  edges  whenever  d(x,y )  <  e  for  x,y  €  Y,  where  d  is  the  Euclidean  metric.  We  can  expand 
a  neighborhood  graph  to  a  VR  complex  as  follows.  Let  Ne  =  (V,  E),  where  V  and  E  are  the 
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nodes  and  edges  of  Ne,  respectively.  When  all  edges  of  cr,  denoted  by  e(cr),  are  in  JV6  we 
append  cr  to  Ne  to  obtain  the  VR  complex 

Ke(Y)  :=VUEU{cr  \  e(o-)  c  E } ,  (1.5) 

The  last  term  simply  states  that  if  all  edges  of  cr  are  in  Ne  then  cr  belongs  to  7^(7).  In  other 
words,  HeiY)  is  composed  of  simplices  whose  vertices  are  each  within  e  of  every  other  vertex 
in  the  simplex. 

The  VR  complex  is  a  critical  tool  for  computational  topologists  as  it  represents  the  topol¬ 
ogy  of  a  point  set  and  is  relatively  fast  to  compute  [19].  Hence,  in  the  computational  setting 
we  utilize  VR  complexes  exclusively.  Nevertheless,  to  obtain  the  theoretical  results  below  the 
Cech  complex  is  more  suitable.  It  is  constructed  similarly  to  a  VR  complex  [3,  9].  Given  6,  the 
Cech  complex  Ce(Y)  contains  a  simplex  for  every  subset  of  balls  of  radius  e  with  nonempty 
intersection.  It  is  easy  to  see  that  given  e '  and  e  =  2e\  ‘RpiY)  c  Ce{Y).  The  relationship 
between  e  and  e'  can  be  made  tighter,  a  fact  which  we  formalize  in  Section  1.3. 


(a)  VR  complex 


(b)  Cech  complex 


Fig.  1.3.  (a)  The  balls  forming  the  VR  complex  have  radius  e.  Since  all  edges  are  in  the  graph,  the  2-simplex 
(the  triangle )  is  an  element  of  the  VR  complex,  (b)  The  Cech  complex  results  from  balls  of  radius  -^r.  In  this  case 

the  balls  have  a  nonempty  intersection,  so  the  2-simplex  defined  by  the  three  nodes  is  included  in  the  Cech  complex. 


Lastly,  a  k- skeleton  of  a  simplicial  complex  K  is  the  subcomplex  of  K  having  faces  of 
dimension  no  larger  than  k.  The  1 -skeleton  of  ^(X)  is  just  the  neighborhood  graph  with  a 
threshold  of  e  consisting  of  nodes  (vertices)  and  edges  between  nodes  that  are  less  than  e 
apart.  Recall  that  a  clique  is  a  set  of  nodes  in  a  graph  that  include  a  complete  subgraph.  If 
a  clique  cannot  be  made  any  larger  then  it  is  termed  maximal.  The  maximal  cliques  in  a  VR 
complex  are  the  maximal  simplices  of  the  complex. 

1.3.  Simplicial  Measures.  Recall  that  we  are  interested  in  using  the  maximal  simplices 
in  a  VR  complex  to  understand  the  density  of  points  in  an  attractor.  We  first  describe  the 
fundamental  notions  behind  the  definition  of  the  simplicial  measure.  We  then  define  the 
simplicial  measure. 

Let  /  be  a  diffeomorphism  of  the  manifold  X.  Suppose  U  is  a  neighborhood  of  the  attrac¬ 
tor  A  of  /,  which  satisfies  Smale’s  Axiom  A.  Namely,  we  require  i)  that  the  non-wandering 
points  of  /,  £!(/),  form  a  hyperbolic  set;  and  ii)  that  the  periodic  points  of  /  are  dense  in  £!(/). 
Given  these  assumptions,  then  for  all  v  in  U  there  exists  a  probability  measure  p,  called  the 
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SRB  measure,  such  that  [16], 


where 


if  y  e  B 
if  y£B. 


(1.6) 


(1.7) 


The  limit  in  (1.6)  is  taken  on  the  set  !B(X)  of  all  Borel  probability  measures  on  X  which 
is  given  the  weak*  topology.  Let  pn  -  \  Z/Pofy**  a  measure  in  Sffl  and  let  0  be  a 
continuous  function  on  X.  In  the  weak*  topology  \in  — >  p  iff  f  fdpn  — >  f  fdp. 

We  point  out  that  general  tent  maps  for  which  \  f'\  >  1,  such  as  the  one  used  to  construct 
the  Cantor  set  in  Section  1.1.1,  satisfy  the  requirements  of  Axiom  A  systems.  The  nonwan¬ 
dering  set  Cl(f)  =  F  is  hyperbolic.  And  as  was  shown,  the  orbits  of  /,  and  in  particular  the 
periodic  points,  are  dense  in  F. 

In  summary,  the  existence  of  the  SRB  measure  in  (1.6)  holds  for  almost  every  x  in  some 
neighborhood  U  of  A  with  positive  Lebesgue  measure.  The  measure  p  is  concentrated  on 
A  and  is  typically  singular  with  respect  to  Lebesgue  measure.  In  other  words,  the  Lebesgue 
measure  of  A  is  zero  while  //(A)  =  1.  And  the  ergodic  averages  on  the  left  hand  side  of  (1.6) 
converge  to  p. 

In  practice,  we  assume  that  we  have  a  set  of  data  points  Y  c  X  generated  by  the  observa¬ 
tion  of  a  physical  system  or  by  a  computer  model.  By  (1.6),  the  ergodic  averages 


1  n—  1 

~X6ftx 

k= o 


(1.8) 


approximate  the  measure  p  on  the  attractor  of  the  system.  The  length  of  the  transient  can  pose 
issues,  and  is  system-dependent.  Methods  related  to  (1.8)  have  historically  been  used  to  an¬ 
alyze  both  physical  systems  as  well  as  computer-generated  models.  Most  assume  something 
about  the  attractor-either  its  existence  or  that  points  in  Y  tend  toward  it.  Thus,  while  the  SRB 
measure  theoretically  provides  a  strong  method  of  approximating  the  measure  on  an  attractor, 
in  what  follows  we  are  forced  to  be  less  rigorous  with  respect  to  the  actual  measure  that  we 
use.  In  other  words,  we  do  not  assume  that  the  measure  we  approximate  is  the  unique  SRB 
measure. 

Given  e  >  0,  we  construct  a  VR  complex  on  the  points  in  Y.  It  is  convenient  for  the 
theoretical  development  to  use  Cech  complexes  formed  from  balls  of  radius  e.  A  theorem  of 
de  Silva  and  Ghrist  [3]  provides  the  necessary  relationship  between  VR  complexes  and  Cech 
complexes: 

Theorem  1.1.  For  a  set  of  points  Y  in  Rd,  the  Cech  complex  Ce(Y)  is  bounded  by  VR 
complexes  as  follows: 


RAY)  c  Ce(Y)  c  <Re(Y) 


(1.9) 


whenever  e  >  e' 

We  conclude  from  Theorem  1.1  that  for  each  <x  e  R€fY),  there  exists  a  point  y  e  A  such 
that  Be(y)  circumscribes  the  vertices  of  <r,  where  e  -  er  Figure  1.3  illustrates  the  non¬ 

extremal  case,  in  which  the  vertices  of  the  VR  complex  are  each  less  than  a  distance  e'  from 
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one  another.  The  centroid  falls  inside  the  open  neighborhood  determined  by  the  intersecting 
balls  in  the  Cech  complex,  as  in  Figure  1.3(b).  In  the  extremal  case,  where  the  vertices  of 
He'iY)  are  exactly  pairwise  e'  apart,  the  intersection  of  the  balls  in  a  Cech  complex  is  a  single 
point,  y. 

In  fact,  y  is  unique  in  both  the  extremal  and  non-extremal  cases.  To  see  this,  define  a  map 
from  Ce(Y)  that  chooses  the  centroid  of  each  simplex, 

cpe  =  0  :  Ct(Y)  -»  Rd  (1.10) 


given  by  0(<x)  =  y. 

Claim  1.1.  0  is  one-to-one. 
Proof.  Let  dim(<x)  =  n.  Define 


/(*):=  max  ||x-v||,  (1.11) 

ve[v0,...,vn] 

which  determines  the  minimal  radius  of  a  ball  circumscribing  the  vertices  of  cr.  If  0  is  not 
injective,  then  there  exist  points  y  ^  y’  such  that /(y)2  =  f(y')2  =  ||v*  -y||2  =  ||v/-y'|l2  for  each 
vertex  v*.  Relabeling  /  =  y  +  Av,  A  >  0  and  v  e  Rd,  we  have  that  ||v/  -  y||2  =  ||v*  -  (y  +  Tv)||2, 
which  holds  iff  A  =  0.  □ 


Fig.  1 .4.  A  simplex  with  an  associated  circumscribing  ball  B.  The  centroid  is  denoted  with  an  x.  The  ball  has 
a  residence  measure  of3/N. 


Therefore,  for  each  simplex  cr  e  Pe'(Y)  there  exists  a  unique  ball  centered  at  y  =  0(<r) 
with  radius  in  [e\  e ]  that  contains  all  of  the  vertices  in  cr.  The  simplices  in  PefY)  determine 
the  points  y  e  A  which  approximate  the  local  density  of  points  in  A. 

To  bring  this  into  the  sphere  of  measure  and  dimension  theory,  we  define  the  residence 
( probability )  measure  on  a  set  U  c  X  similarly  to  (1.6)  by 

I1(U)  =  lim  —#{k  I  1  <k<m,  fk(x)  e  U)  (1.12) 

m— >oo  jfi 

for  any  x  e  X.  The  measure  computes  the  amount  of  time  iterates  of  /  spend  in  U.  When 
using  the  finite  set  of  data  points,  Y  =  {xi}f=1,  we  approximate  /i  in  Equantion  (1.12)  by 
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removing  the  limit.  Thus,  for  a  subset  U  c  7,  let 

v(U)  =  ^#{k  |  1  <  k  <  N,  fk(x)  e  U]  (1.13) 

=  ^#{x|  xeU)  (1.14) 

Note  that  v(F)  =  1.  The  ^  term  will  be  left  off  to  avoid  clutter. 

Let  Me(Y)  be  a  collection  of  points  in  Rd  determined  by  the  centroids  in  Ce(Y)  so  that 

Me(Y)  :=  {y  e  X  \  </>(&)  =  y,  for  some  cr  6  Ce(Y)}.  (1.15) 

For  each  y  in  Me(Y)  there  is  a  radius  S  e  [e\  e]  such  that  =  k  +  1  where  &  =  dim(cr) 

and  0(<x)  =  y.  Therefore,  by  abuse  of  notation,  we  can  define  the  simplicial  measure  on 
simplices  !R6/(F)  as 


v(cr)  =  v[5j(0(cr))]  =  dim(cr)  +  1, 


(1.16) 


where  cr  e<Re'(Y). 

We  now  formulate  a  well-defined  map  between  VR  complexes  ^(T)  and  centroids  in 
Ale(T).  From  the  first  inclusion  in  Theorem  1.1,  if  <x„  =  [vo, . . . ,  vn]  is  a  ^-simplex  in  ^(T), 

then  [vo, . . .  ,vn\  is  a  ^-simplex  in  Ce(F),  where  <  £•  Hence,  via  the  inclusion 

i :  He'iY)  Ce(Y)  we  associate  to  each  crn  e  ^{Y)  an  ^-simplex  in  Ce(Y).  Thus,  we  define 

g:=(f>oi.  (1.17) 

This  identifies  every  ^-simplex  in  ^{Y)  with  a  point  in  A  for  which  there  is  a  ball  of  radius 
6  with  measure  n  +  1 . 

Since  computation  and  simplex  generation  occurs  at  the  level  of  the  VR  complexes,  the 
diagram  in  (1.18)  serves  to  connect  VR  complexes  to  centroids  in  Me(Y).  The  analysis  above 
shows  how  0  allows  for  the  definition  of  a  well-defined  map  from  the  VR  complex  to  the 
centroid  set,  and  hence  to  balls  of  radius  at  most  e  which  have  the  measure  of  the  dimension  of 
each  simplex  in  the  complex.  We  summarize  the  above  discussion  in  the  following  diagram: 


We  have  dealt  above  with  simplicial  complexes  containing  all  possible  simplices.  Since 
every  simplex  contains  many  faces,  and  these  faces  all  contain  their  own  centroids,  measures 
of  balls  around  points  in  Me(Y)  would  grossly  overestimate  the  density  of  a  region  of  space 
according  to  the  simplicial  measure.  Yet,  computing  the  dimension  of  A  requires  above  all 
else  accuracy  in  the  estimation  of  the  density  of  a  region.  In  the  next  section  we  briefly  discuss 
the  reasons  for  this  in  the  context  of  dimension  theory. 

1.4.  Dimension.  Fundamentally,  determining  the  dimension  of  a  set  relies  on  “measur¬ 
ing”  how  much  space  a  set  occupies  at  various  scales.  Thus,  it  is  imperative  that  we  are  able 
to  faithfully  estimate  the  density  of  a  region  of  space  using  v.  Thence,  we  must  negate  any 
significant  overlap  amongst  the  simplices  in  ^(F).  We  detail  in  this  section  the  heuristic 
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Fig.  1.5.  In  this  small  simplicial  complex,  the  two  maximal  simplices  are  the  filled  (green)  2-simplex  and 
3-simplex.  Since  dim(cr)  >  dim(r),  Tx  =  {cr}.  Symmetric  reasoning  holds  for  the  3-simplex  on  the  right. 


notions  of  “fractal  dimension”  in  order  to  motivate  our  subsequent  focus  on  the  minimal  set 
of  maximal  simplices. 

Suppose  Me{Y)  is  some  measurement  of  the  set  Y  at  the  scale  e.  As  e  — >  0  we  consider 
how  Me(Y)  behaves.  A  spacial  dimension  for  A  can  be  approximated  using  the  power  law 
relationship 


Me(Y)  ~  6_Qr. 


(1.19) 


The  scaling  exponent  a  >  0  is  the  dimension  we  seek.  Taking  the  limit  of  e  we  get, 


a  =  lim 

6 — >0 


1  QgMe(Y) 
-log  6 


(1.20) 


which  lends  itself  nicely  to  numerical  estimation  of  a  by  finding  the  slope  of  the  linear  re¬ 
gression  line  to  the  log-log  plot  of  6  vs.  Me.  For  example,  suppose  Me  measures  ordinary 
Lebesgue  area  in  R2,  then  a  must  be  2. 

Sometimes,  a  depends  on  the  location  at  which  a  measurement  is  taken.  In  this  case  one 
obtains  an  entire  spectrum  of  scaling  exponents  leading  to  the  multifractal  spectrum  of  a  set. 
We  discuss  these  ideas  in  more  detail  in  Section  1.7. 


1.5.  Pruning  the  Family  of  Maximal  Simplices.  A  crucial  step  in  utilizing  the  simpli¬ 
cial  measure  to  determine  the  dimension  of  a  set  is  to  make  sure  that  the  simplicial  measure 
accurately  approximates  the  mass  distribution  of  points  at  scale  e.  By  considering  the  sim¬ 
plices  of  maximal  dimension  associated  to  each  x  e  Y,  we  can  exclude  the  measurement  of 
faces  contained  in  these  maximal  simplices.  We  first  show  how  to  obtain  this  from  Y ) 

before  we  describe  the  computational  aspects  of  this  process. 

For  each  x  e  Y  let 


Sx  =  {cre<RAY)\xecr}.  (1.21) 

By  considering  only  the  maximal  simplices  in  S  x  we  associate  to  each  x  the  set 

Tx  =  {cr'  |  dim  (cr')  >  dim  (cr),  Vcr  <=  S  x}  (1.22) 

Figure  1.5  shows  an  example  for  a  small  simplicial  complex  composed  of  a  l-,2-,and  3- 
simplices.  The  vertex  x  belongs  to  both  cr  and  r,  so  Sx  =  {<r,  r}.  Yet  dim(cr)  >  dim(r)  so  only 
cr  is  included  in  Tx. 
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Let  T  be  the  family  [Tx}xeY  of  maximal  simplices  for  each  point  in  Y.  In  practice, 
the  formation  of  T  is  rather  involved.  Zomorodian  has  implemented  fast  algorithms  for  the 
construction  of  the  VR  complex  as  well  as  the  selection  of  maximal  simplices  from  such  a 
complex  (see  [19,  20]).  We  utilize  these  to  obtain  T  for  a  VR  complex  <Re( Y ). 

For  the  example  in  Section  1.6.1,  T  is  composed  of  all  sets  in  Figure  1.6.  In  general, 
the  generation  of  maximal  simplices  does  not  create  disjoint  simplices.  Thus,  even  with  T  in 
hand,  we  must  implement  a  second  round  of  pruning. 


Algorithm  5  Disjoint-Cover(7“) 

1:  C  0  //  C:  disjoint  simplicial  set 
2:  P  <—  0 // P:  candidates  for  filling  gaps 
3:  R  0 // R:  non-disjoint  sets 

4:  while  T  +  0  do 

5:  cr  < —  7“.next()  //  remove  largest  simplex  from  T 

6:  L  <—  \cr  \  C\  //  number  of  non-intersecting  vertices  in  cr 

7:  if  [  thencr  is  disjoint  from  C]|L|  =  \cr\ 

8:  C  < —  C  U  cr  //  add  cr  to  cover 

9:  C.vertices  C. vertices  U  cr.vertices 

10:  else 

11:  R  <—  cr 

12:  end  if 

13:  end  while 

14:  make-hash(R,  P)  return  C,  P 


1.6.  The  Tidy  Cover  Algorithm.  Recall  that  a  collection  of  sets  {Va}a€j  is  said  to  cover 
a  set  Z  if 


(jvaDZ,  (1.23) 

a 

where  J~  is  an  index  set.  We  define  a  simplicial  cover  to  be  a  cover  of  the  set  Y  by  a  collection 
of  simplices  from  !R6(F).  We  use  the  algorithms  Disjoint-cover  and  Gap-cover  to  pare  down 
T  to  a  subcollection  of  simplices  that  is  close  to  the  minimal  number  necessary  to  cover 
Y.  We  term  this  collection  a  tidy  cover.  Together  we  call  Disjoint-cover  and  Gap-cover  the 
tidy-cover  algorithm. 


Algorithm  6  make-hash(R,  P) 

1:  for  cr  e  R  do 
2:  m  <—  \L\ 

3:  cr.complement  L  //  add  complement  vertices  as  simplex  attribute 

4:  P[m]  P[m]  V  cr  H  add  cr  to  P,  indexed  by  m 

5:  end  for 


Before  calling  Disjoint-Cover  we  sort  the  simplices  in  T  in  decreasing  order.  The  pri¬ 
mary  purpose  of  Disjoint-Cover  is  to  cover  Y  as  completely  as  possible  with  a  collection 
of  the  largest,  disjoint  simplices  from  T.  While  it  creates  this  collection  an  additional  data 
structure  is  constructed  and  returned  along  with  C.  These  objects  aid  Gap-Cover  in  efficiently 
filling  in  the  gaps  in  C.  We  use  the  notation  from  C++  to  denote  attributes  and  methods 
in  the  algorithms. 
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As  mentioned,  Disjoint-Cover  builds  a  disjoint  collection  from  the  largest  non-intersecting 
simplices  in  T  (line  7).  While  building  C,  if  it  finds  a  simplex  that  intersects  one  of  the  mem¬ 
bers  of  C  (line  11),  this  simplex  is  stored  in  R.  When  a  disjoint  cover  of  otained,  the  hash 
table  P  is  constructed  in  make-hash.  Keyed  by  the  complement  sizes  of  simplices  left  out 
of  C,  P  maps  these  complement  sizes  to  lists  of  simplices  having  that  size  complement.  By 
sorting  P  by  its  keys  in  decreasing  order,  this  data  structure  allows  Gap-cover  to  short  circuit 
earlier  by  adding  the  simplices  that  most  efficiently  cover  the  gaps  of  points  left  uncovered 
during  the  construction  of  C. 


Algorithm  7  Gap-Cover (C,  P,N  =  \Y\) 

Require:  P  sorted  by  keys  in  decreasing  order 

l:  for  m  e  P  do 
2:  while  P[m]  A  0  do 

3:  cr  <—  P[m]. next()  //  remove  cr  with  (possibly)  most  uncovered  vertices 

4:  m  <—  | cr. complement  Pi  C\ 

5:  if  [  thencr  has  the  most  uncovered  vertices]m  =  0 

6:  C  <—  C  U  cr  //  add  to  cover 

7:  C.vertices  <—  C.vertices  U  <x.  vertices  //  mark  vertices  as  covered 

8:  else  [earlier  addition  to  cover  altered  complement] 

9:  cr.complement  <—  cr  \  C 

10:  m'  |<x.complement| 

11:  P[m']  P[m']  U  cr//  update  P  according  to  cr’ s  uncovered  vertices 

12:  end  if 

13:  if  |C.vertices|  =  N  then  return  C 

14:  end  if 

15:  end  while 

16:  end  for 


Now  consider  the  Gap-cover  algorithm.  As  noted,  P  is  sorted  by  keys;  let  m  be  the  largest 
key,  or  complement  size.  Assuming  that  C  is  not  a  cover  of  Y,  Gap-cover  starts  by  adding 
the  first  simplex,  say  cr,  in  the  list  of  simplices  pointed  to  by  P[m\  (line  5).  Thus,  the  most 
possible  new  points  are  added  to  C.  Once  cr  is  added,  it  is  possible  that  the  complement  sizes 
of  some  of  the  remaining  simplices  in  P  have  changed.  We  determine  this  in  a  “lazy”  way  by 
waiting  until  a  simplex  r  is  chosen  from  P[m]  for  which  r.  complement  Pi  C  is  non-empty.  Then 
m  (line  3)  will  be  different  from  zero,  triggering  gap-cover  to  update  r’s  position  in  P  (lines 
7-10).  Note  that  elements  in  P[m]  are  removed  from  the  front  of  the  list  and  not  returned. 
If  P[m]  is  empty,  m  is  incremented  to  the  next  largest  key  (complement  size).  During  this 
process,  if  the  number  of  vertices  covered  by  simplices  in  C  equals  the  number  of  data  points, 
gap-cover  terminates. 

An  important  aspect  of  the  above  algorithms  is  that  only  one  set  of  simplices  is  con¬ 
structed  and  used.  While  a  number  of  data  structures  are  constructed,  only  pointers  are  ma¬ 
nipulated  as  simplices  are  “moved”.  So  the  memory  footprint  is  of  size  0(T). 

By  construction,  the  Tidy-cover  algorithm  returns  a  tidy  cover  of  Y.  We  compared  the 
Tidy-cover  algorithm  to  a  naive  greedy  cover  algorithm  that  updates  complements  without  the 
use  of  the  adjacency  graph.  On  a  multifractal  Cantor  set  with  10,000  points  (see  Section  1.6.1) 
Tidy-cover  returns  a  tidy  cover  approximately  250  times  faster  than  an  algorithm  that  fills  the 
gaps  but  naively  updates  the  complements  without  using  the  hash  P. 

With  the  tidy  cover  in  hand  we  have  an  accurate  notion  of  the  density  of  points  in  an 
attractor  A  using  the  sampling  of  points  Y.  We  can  now  employ  the  techniques  of  dimension 


188 


Multifractals  and  SPOT  Extensions 


theory  to  approximate  the  fractal  or  multifractal  properties  of  A. 

We  revisit  the  Cantor  set  to  illustrate  the  above  algorithms. 

1.6.1.  Example  (Cantor  set  cover).  Consider  £,  the  Cantor  set  defined  in  Section  1.1.1. 
Let  p+q  =  1  and  wlog  assume  p  <  q.  We  construct  a  measure  p  on  F.  Consider  the  collection 
Ek  of  2k  intervals  of  length  3~k  at  each  level  of  the  construction  of  F.  Then 

oo 

F  =  f]Ek.  (1.24) 

k= 0 

The  first  level  of  the  Cantor  set,  £i,  contains  left  and  right  subintervals,  E\q  and  En , 
respectively.  For  the  left  interval  assign  a  mass  of  p  and  for  the  right  a  mass  of  q.  Similarly, 
£2  is  composed  of  four  subintervals:  A  left  and  a  right  subinterval  within  both  of  £10  and 
£  11 .  As  before,  to  each  of  the  left  intervals  assign  a  mass  of  p ,  and  to  each  of  the  right  assign 
a  mass  of  q.  Continuing  to  divide  the  mass  amongst  the  subintervals  in  the  ratio  p  :  q  yields 
a  mass  distribution  on  £  [6] .  The  resulting  set  with  the  associated  measure  is  referred  to  as  a 
multifractal  Cantor  set. 


DL 


CrC  -  ■■ 


DC  0- 


Fig.  1.6.  The  Cantor  set  constructed  with  uneven  mass  distribution,  covered  by  a  disjoint  simplices  (DC)  and 
a  set  covering  the  gaps  (GC).  The  tidy  cover  from  Section  1.6  is  the  union  of  DC  and  GC.  The  stacks  of  sets  (OL) 
above  are  the  simplices  that  overlap  DC  and  are  not  used  in  GC.  The  disjoint  collection  DC  contains  16  simplices, 
the  final  cover  contains  32  simplices,  and  there  are  1645  sets  discarded.  The  difference  in  sizes  in  OL  is  due  to  the 
uneven  distribution  of  points.  The  blow-up  shows  in  detail  the  overlap  created  by  the  sliding  e  window  mentioned  in 
the  text. 


Given  a  radius  e  we  form  a  VR  complex  on  the  distribution  of  the  N  points  at  the  kth  level 
of  the  Cantor  set.  Intuitively,  since  the  intervals  are  all  the  same  length,  points  are  more  or  less 
densely  packed  according  which  interval  they  fall  into.  For  instance,  the  mass  of  an  interval 
at  the  kth  level  depends  on  the  number  of  0’s  and  l’s  in  the  coding  described  in  Section  1.1.1. 
An  interval  for  which  ij  =  0  n  times  has  mass  pnqk~n. 

Since  the  Cantor  set  is  1 -dimensional,  the  simplices  are  composed  of  points  arranged 
along  intervals  within  a  ball,  or  window,  of  size  6/2  of  one  another.  In  regions  of  high 
density,  “sliding”  this  e-  window  will  create  many  maximal  simplices  and  impressive  amounts 
of  overlap  in  a  narrow  region.  The  detail  in  Figure  1.6  shows  the  maximal  simplices  spread 
out  vertically  that  result  from  this  phenomenon.  One  can  see  how  the  density  in  a  region 
affects  the  amount  of  overlap.  The  number  of  simplices  in  T  in  a  region,  as  well  as  the 
density  of  points  in  that  region,  is  correlated  to  the  height  of  the  stacks  OL  in  Figure  1.6.  The 
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original  arrangement  of  simplices  can  be  recovered  by  projecting  all  of  the  simplicies  in  the 
figure  onto  the  x-axis  (the  horizontal  level  of  DC). 

In  Figure  1.6,  DC  is  the  set  of  simplices  returned  by  disjoint-cover  and  GC  is  the  set 
returned  by  Gap-cover.  The  tidy  cover  is  the  union  of  the  simplices  in  DC  and  GC.  OL  is  the 
remainder  of  maximal  simplices  from  T  that  were  unused  in  the  final  cover. 

1.7.  Results  for  the  Tidy  Cover.  The  box-counting  dimension  is  calculated  by  defining 
Me{Y)  to  be  the  number  of  sets  of  “size”  e  necessary  to  cover  Y.  Computationally  this  is 
often  done  by  “gridding”  the  space  into  squares  of  side  length  e.  Counting  the  number  of 
grid  elements  occupied  by  at  least  one  point  from  Y  gives  Me(Y).  For  example,  consider  the 
ordinary  Cantor  set  described  in  Section  1.1.1.  At  the  kth  level  there  are  2k  sets  of  size  3~k 
so  the  box-counting  technique  yields  a  =  log(2)/  log(3)  «  0.631,  the  fractal  dimension  of  the 
Cantor  set. 

A  numerical  approximation  of  the  box-counting  dimension  of  the  Cantor  set  using  sim¬ 
plices  in  the  tidy  cover  yields  a  similar  result.  We  calculate  a  scaling  exponent  a  ~  0.656, 
which  is  within  0.025  of  the  true  dimension.  Me  is  the  number  of  simplices  in  the  tidy  cover 
at  scale  e.  The  log-log  plot  of  this  behavior  is  shown  in  Figure  1.7.  We  considered  a  sequence 
of  tidy  covers  for  e  in  the  range  [3-10,  3-3]. 

Physical  and  mathematical  systems  often  exhibit  regions  of  varying  density  in  their  at¬ 
tractors  [7,  8].  (Mandelbrot’s  book  contains  many  interesting  examples  and  areas  of  real- 
world  study  [11].)  Let  ft  be  a  measure  on  such  an  attractor  A.  Then  this  situation  manifests 
itself  when  the  set  of  points  for  which  /i(B€(x))  ~  e~a  determines  a  different  fractal  for  a  range 
of  cFs.  The  box-counting  dimension  ignores  the  finer  structure  of  a  measure  and  so  we  must 
introduce  a  set  of  tools  borrowed  from  statistical  physics  [5,  6,  17].  We  briefly  describe  how 
these  work. 


Fig.  1.7.  Estimation  of  the  box  counting  dimension  of  the  Cantor  set  by  the  number  of  simplices  contained  in 
tidy  covers  at  different  e’s. 

We  introduce  the  partition  function  for  q  e  R  and  e  >  0: 

Z(q,e)  =  Y,r(Be)q, 


(1.25) 
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(b)  Power  law  behavior  of  Z 


Fig.  1.8.  In  (a)  each  curve  represents  Z(q,  e)  as  a  function  of  e  with  a  fixed  value  of  q.  The  scaling  behavior  of 
the  Cantor  set  is  clearly  visible  in  the  curves.  The  J3(q)  curve  in  (b)  shows  the  scaling  behavior  of  each  Z(q,  e)  curve 
across  values  of  q. 


where  !B  is  the  set  of  balls  B  of  radius  e  with  /i(B)  >  0.  In  thermodynamics,  q  represents 
inverse  temperature  and  fi  measures  a  “state”  of  the  system  [14,  17].  We  identify  the  power 
law  behavior  of  Z  for  each  q  by  [6] 


p(q)  =  lim 


logZ(g,  e) 
-log(e) 


(1.26) 


We  apply  the  partition  function  and  Equation  1.26  to  the  ordinary  Cantor  set,  C.  In  this 
case  there  is  no  change  to  the  mass  distribution  on  C,  and  therefore  we  expect  a  single  scaling 
exponent.  This  forces  the  function  J3(q)  to  be  linear.  For  a  measure  we  use  the  simplicial 
measure  v  on  each  simplex  in  the  tidy  cover  at  scale  e.  We  use  the  same  sequence  of  tidy 
covers  as  in  the  estimation  of  the  box-counting  dimension.  The  scaling  structure  of  C  can 
be  seen  in  the  partition  functions  in  Figure  1.8(a).  The  behavior  of  Z  over  a  range  of  q  is 
seen  to  be  linear  in  Figure  1.8(b)  indicating  a  single  fractal  dimension.  Fet  f(a)  denote  the 
dimension  of  the  set  of  points  for  which  the  power  law  /i(Be(x))  ~  e~a  holds.  It  is  obtained 
from  fd  via  the  Fegendre  transform  [1,6,  14]. 


f(a)  =  sup{aq-/3(q)}.  (1.27) 

areR 

Since  ft  in  Figure  1.8(b),  is  linear  the  Fegendre  transform  of  p  is  a  single  point  located  at 
(a,  fia)).  Note  that  f(a)  is  just  the  negative  of  the  y-intercept  of  the  line  tangent  to  p  with 
slope  a.  And  so  the  dimension  of  the  set  for  which  ju(Be(x))  ~  e~a  holds  is  f(a).  For  C, 
we  estimate  from  the  functions  shown  in  Figure  1.8  that  the  dimension  is  f{a)  =  /( 0.633)  = 
0.626,  which  is  close  to  the  exact  value  of  log(2 )/  log(3). 

If  p  is  not  linear,  an  entire  f(a)  spectrum  is  derived  via  the  Fegendre  transform  of  p.  This 
yields  the  various  fractal  dimensions  of  the  sets  with  scaling  exponent  a.  We  have  studied 
this  situation  using  both  the  multifractal  Cantor  set  discussed  in  Section  1.6.1  as  well  as  the 
2-dimensional  Henon  attractor.  Both  of  these  dynamical  systems  have  been  shown  in  other 
work  to  have  rich  multifractal  structures  leading  to  a  spectrum  of  f(a )  dimensions.  Our 
results  are  preliminary  and  unfortunately  do  not  show  the  nonlinear  behavior  expected  of  p. 

1.8.  Conclusions.  Attractors  of  relatively  simple  physical  and  dynamical  systems  often 
possess  rich  structure  whose  study  often  sheds  light  on  the  system  itself. 
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Recent  work  in  computational  topology  has  led  to  efficient  algorithms  for  computing 
simplicial  structures  on  Y.  We  have  introduced  the  notion  of  a  simplicial  measure  using  sim- 
plices  from  a  VR  complex  constructed  on  Y.  In  order  to  utilize  this  measure  to  approximate 
the  dimension  of  an  attractor  from  the  points  in  Y  we  must  filter  the  simplices  in  the  VR 
complex. 

The  first  step  uses  the  algorithms  implemented  by  Zomorodian  to  create  a  subset  of 
maximal  simplices.  We  then  developed  the  tidy-cover  algorithm  to  deal  with  the  overlap 
problem  in  the  set  of  maximal  simplices.  The  algorithm  creates  a  cover  in  an  efficient  and 
greedy  way,  using  an  adjacency  graph  and  complement  sizes  to  keep  updates  to  a  minimum. 
We  demonstrated  the  way  in  which  this  algorithm  extracts  a  tidy  cover  from  the  VR  complex 
on  a  multifractal  Cantor  set,  yielding  a  “minimal”  set  of  maximal  simplices  that  cover  the  set 
by  the  simplices  of  highest  dimension  possible.  By  construction,  these  simplices  have  very 
little  overlap. 

We  showed  the  usefulness  of  the  tidy  cover  in  computing  dimension.  On  an  ordinary 
Cantor  set  the  tidy  cover  works  very  well  for  computing  the  dimension  using  box-counting 
techniques.  Furthermore,  we  demonstrated  that  techniques  used  on  multifractal  sets,  such  as 
the  partition  function,  behave  well  when  the  simplicial  measure  is  used  with  the  simplices  in 
the  tidy  cover.  In  both  cases  we  were  able  to  approximate  the  fractal  dimension  of  the  Cantor 
set  to  within  0.025  of  the  actual  dimension. 

Our  numerical  methods  have  not  yet  performed  well  on  multifractal  systems,  such  as  the 
multifractal  Cantor  set  discussed  in  Section  1.6.1  or  the  Henon  attractor  [6].  We  will  address 
these  issues  in  future  work. 

2.  Extensions  for  the  TEVA-SPOT  Toolkit.  The  Threat  Ensemble  Vulnerability  As¬ 
sessment  and  Sensor  Placement  Optimization  Tool  (TEVA-SPOT)  is  a  joint  project  of  the  U.S. 
Environmental  Protection  Agency,  Sandia  National  Laboratories,  Argonne  National  Labora¬ 
tory,  and  the  University  of  Cincinnati.  TEVA-SPOT  was  designed  to  model  a  wide  range 
of  sensor  placement  problems  to  optimize  a  real-time,  on-line  contaminant  warning  system 
(CWS)  [2]. 

This  section  details  a  number  of  recently  developed  Python  extensions  to  the  TEVA- 
SPOT  Toolkit. 

2.1.  SP  Module.  The  TEVA-SPOT  sensor  placement  solvers  are  launched  using  the 
spmodule.  The  module  takes  as  input  one  or  more  impact  files  and  returns  a  sensor  placement. 
sp  interfaces  with  various  MIP  solvers.  (See  [2]  for  more  details.) 

2.2.  Goal  of  the  Extensions.  We  briefly  describe  optimization  using  objectives  defined 
in  TEVA-SPOT.  In  this  work  optimization  is  synonymous  with  minimization  of  an  objective. 
A  Pareto  optimal  set  in  the  solution  space  is  the  set  of  feasible  solutions  such  that  for  each 
solution,  there  exists  no  feasible  solution  that  is  better  than  the  reference  solution  in  at  least 
one  objective  and  is  no  worse  in  the  remaining  objectives  [4]. 

In  the  preliminary  work  reported  in  this  paper  we  consider  the  set  of  Pareto  optimal 
solutions  associated  with  two  objectives  in  a  CWS,  namely“expected  contamination”  (EC, 
measured  in  pipe  feet)  and  “mass  consumed”  (MC,  measured  in  mass  of  contaminant).  If 
no  bound  is  put  on  MC,  then  optimizing  EC  will  yield  a  mean  impact  of  approximately  EC 
«  8500.  But  in  this  case  MC  ends  up  quite  high  with  a  mean  impact  of  MC  «  43650.  If  we 
insist  on  a  better  value  of  MC  we  must  constrain  it  by  passing  in  an  upper  bound  to  the  solver. 
In  doing  so  we  necessarily  drive  EC  up.  A  Pareto  front  or  Pareto  frontier  is  a  set  of  discrete 
or  continuous  regions  representing  such  a  set  of  tradeoffs.  In  general,  these  regions  form  d-  1 
dimensional  hyersurfaces  in  the  solution  space.  The  points  on  the  Pareto  front  dominate  the 
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other  solutions.  Figure  2.1  shows  the  Pareto  front  generated  by  restricting  MC  to  values  less 
than  43650.  Infeasible  solutions  result  for  values  of  MC  less  than  ~  22, 000. 

The  primary  goal  of  the  extensions  is  to  study  such  Pareto  fronts.  There  are  two  avenues 
that  we  plan  to  follow.  First,  we  wish  to  investigate  the  structure  of  the  near  optimal  solutions 
along  the  Pareto  front.  We  expose  pico’s  enumRelTol  option  in  sp  to  allow  an  optimality 
gap  within  a  given  percentage  of  an  optimal  solution.  For  instance,  we  constrain  MC  and 
require  that  pico  return  all  solutions  within  (say)  10%  of  the  optimal  EC  value  at  the  given 
MC  value.  The  Pareto  front  itself  is  discrete  and  forms  a  relatively  sparse  set  in  the  solution 
space.  The  near-optimal  solutions  form  a  set  with  potentially  interesting  topological  features. 
In  addition,  knowledge  of  near-optimal  solutions  allows  more  flexibility  in  real-world  sensor 
placement  decisions.  See  Figure  2.1. 

Another  route  of  analysis  is  the  graphical  analysis  of  the  water  network  itself.  Statistics 
such  as  the  frequency  of  junction  selection  in  solutions,  distance  between  sensors  (in  linear 
pipe  feet),  and  mean  impact  of  the  sensor  placement  are  relatively  easy  to  study  as  node  and 
edge  properties  in  a  graph.  See  Figure  2.2. 

2.3.  SP  Extensions  for  Near-Optimal  Analysis.  A  number  of  Python  modules  were 
added  to  the  SPOT-TEVA  code  this  summer  to  facilitate  data  analysis.  They  fall  in  a  few 
general  categories.  First,  the  most  generically  useful  adds  parsing  and  conversion  utilities  via 
the  re _parser. py  module.  Extraction  of  data  for  analysis  relies  on  this  module.  Secondly,  the 
results  can  be  analyzed  in  a  traditional  way  by  plotting  one  objective  against  a  constrained 
objective.  This  facility  is  generalized  to  allow  the  user  to  specify  the  objectives  to  be  plotted. 
Thirdly,  the  solutions  can  be  considered  on  the  network  that  produced  the  scenarios,  with 
appropriate  statistics  computed  for  nodes  in  the  network.  All  of  the  extensions  in  Table  2.1 
are  new,  with  the  exception  of  sp2  which  is  a  copy  of  sp  with  pico’s  enumRelTol  option 
exposed.  We  describe  their  capabilities  in  more  detail  below. 

1.  sp2:  Added  option  parsing  ability  that  exposes  Pico  option  --enumRelTol  and  as¬ 
sociated  value.  This  extension  is  in  the  PICO()  class  and  appends  the  option  onto 
the  PICO  call.  Since  this  is  the  only  addition  to  sp  we  will  simply  reference  the  sp 
module  where  is  appears  below. 

2.  re_parser:  This  regular  expression  parser  handles  sp’s  text  output.  This  is  quite 
useful  in  general  since  it  creates  Python  dictionaries  from  sp  output.  We  summarize 
its  current  abilities: 

•  Handles  regular  sp  output  and  enumerated  output  (only  from  pico’s  raw  text 
output  at  the  moment). 

•  Write  solutions  to  yaml  SOLN  files.  Also  loads  these  yaml  files. 

•  Functions  parse  .coordinate  and  parse_pipes  handle  line  input  from  INP  file. 
Respectively,  return  a  geographic  coordinate  of  each  junction  ID  in  the  water 
network  and  the  pipe  connections  (eg.,  edges  in  a  graph  of  the  water  network). 

3.  constrained  .objective:  This  module  interfaces  with  sp  to  run  optimization  problems 
with  one  constrained  objective  (see  Figure  2.1).  It  contains  two  classes  as  well  as  all 
options  for  input  parsing.  One  handles  command  processing;  the  other  handles  the 
subprocess  call  to  sp: 

•  Class  Processor()  contains  the  low  level  processing  for  input  (creating  com¬ 
mand  to  pass  to  sp)  and  output  (handling  text  solution  returned  or  written  to 
disk  by  sp).  It  primarily  uses  re  .parser  for  this. 

•  Class  ConstrainedObjective()  inherits  from  Processor().  It  sets  up  the  call  to 
sp,  handles  any  enumeration  parsing  that  must  be  done  (assuming  more  than 
pico  was  operational  via  sp),  and  handles  the  returned  solution  with  processor’s 
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Table  2.1 

teva-spot  Python  Extensions 


Name  of  extension 

Summary 

Status 

sp2 

A  version  of  sp  exposing 

PICO’s  enumRelTol  op¬ 
tion. 

Working.  Calls  PICO; 

sp’s  default  output  does  not 
handle  enumeration  (han¬ 
dled  via  re parser.py). 

CONSTRAINED  .OBJECTIVE 

Formulates  a  command  to 
pass  to  sp2  for  calling  a 
solver  with  a  constrained 
objective. 

Working.  Currently 

works  only  with  PICO’s 
enumRelTol  option  now. 
Uses  RE  .PARSER.  PY  for 
solution  parsing. 

RE  .PARSER 

Parses  output  from  sp 
and  PICO  enumerations. 
Writes  results  in  dictionary 
format  to  yaml  file. 

Working.  Should  be  easy 
to  add  new  parsing  tools 
by  follow  model  of  exist¬ 
ing  ones. 

OPTIMA  .ANALYZER 

Contains  tools  for  plotting 
Pareto  fronts,  plotting  net¬ 
works  (using  Network() 
class),  as  well  as  exporting 
to  Protovis  format. 

Working.  NetworkX 

graphs,  graph  analyzers 
and  10  utilities  are  all 
stable. 

Network  .protovis 

Converts  node  and  edge 
data  created  in  Network() 
class  and  saved  in  .npy  for¬ 
mat  to  VTK  format  for 
use  with  Titan’s  Protovis 
interface  [15].  This  ba¬ 
sically  pipes  the  data  to 
an  interactive  web  browser 
format. 

alpha.  Draws  the  network, 
and  node  sizes.  There  are 
still  issues  with  Protovis/- 
javascript. 

methods. 

4.  optima  .analyzer:  This  module  has  two  main  purposes  (plus  unfinished  tools): 

•  Plot  Pareto  front  by  extracting  solutions  from  SOLN  files  (Python  diets). 

•  Class  Network^)  inherits  from  NetworkX.Graph()  package.  Nodes  are  water 
network  junctions,  with  node  data  ‘junctionID’,  ‘size’,  ‘longitude’,  ‘latitude’. 
All  data  is  extracted  from  appropriate  INP  file  except  ‘size’,  which  is  deter¬ 
mined  by  how  often  a  junction  occurs  in  all  enumerated  solutions  (in  SOLN 
files).  Edges  link  nodes  using  Pipes  field  in  INP  files. 

•  An  experimental  Network  attribute  related  to  nodes  and  edges  is  the  ‘sensor’ 
attribute.  This  attempts  to  split  ‘size’  into  a  more  refined  set  of  quantities 
according  to  how  often  a  node  is  chosen  1st,  2nd,  etc. 

•  All  of  the  node,  edge,  and  sensor  attributes  of  a  Network  object  can  be  ex¬ 
ported  as  a  numpy  array  with  custom  data  type.  For  instance,  nodes  — >  ar- 
ray((junctionID,  longitude,  latitude,  size),  dtype=(int,  float,  float,  float)).  These 
are  more  easily  read  and  parsed  for  conversion  to  VTK  Tables  in 
Network_protovis  . 

5.  Network  .protovis:  Note:  This  is  written  on  top  of  Titan,  which  must  he  installed 
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to  run  this  module.  This  module  is  very  basic  at  the  moment.  All  file  names  are 
hardcoded-i.e.  command  line  interaction  has  not  been  included  yet. 
Network _protovis  reads  numpy  arrays  stored  on  disk  (see  optima_analyzer),  con¬ 
verts  them  to  VTK  Tables,  and  converts  these  to  an  appropriate  format  (JSON)  for 
Protovis.  VTK  includes  a  wrapper  for  Protovis,  so  we  embed  Javascript  code  that  is 
piped  Protovis.  Calls  browser  and  renders  network.  Output  is  similar  to  the  network 
rendered  by  optima_analyzer  but  with  more  interactive  qualities.  See  Figure  2.2(b). 
6.  near_optima_script:  This  small  module  loops  over  a  set  of  upper  bounds,  running 
constrained  .objective  at  each  iteration. 

2.4.  Conclusions.  We  have  developed  extensions  to  the  TEVA-SPOT  software  package 
that  facilitate  data  analysis.  Most  of  the  tools  developed  and  discussed  above  are  fairly  general 
and  can  be  ported  to  other  optimization  problems  with  few  modifications.  Additionally,  the 
network  visualization  tools  provide  novel  ways  to  understand  multi-objective  optimization 
problems.  Therefore,  we  are  continuing  to  develop  the  visualization  aspects  of  these  modules. 
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(a)  Pareto  front 


(b)  Enumerated  near  optimal  solutions 


Fig.  2.1.  (a)  The  Pareto  front  for  objectives  EC  and  MC,  where  a  sequence  of  upper  bounds  have  been  imposed 
on  MC  ( side-constraints ).  The  true  optimum  for  minimization  of  EC  for  the  given  scenario  (5  sensors  and  given 
impact  files)  is  the  point  in  the  lower  right,  (b)  Enumeration  of  solutions  within  10%  of  Pareto  optimal  for  the  same 
sequence  of  upper  bounds. 


Network  junctions  located  geographically:  Size  corresponds  to  frequency 
of  occurence  in  optimal  and  near-optimal  solutions 


(a)  NetworkX-generated  network 


(b)  Protovis-generated  network 


Fig.  2.2.  (a)  Static  image  of  the  Net3  sample  network.  Node  size  corresponds  to  occurence  of  nodes  in 
near-optimal  solutions,  (b)  Titan/P  rotovis- generated  graph.  Node  size  is  the  same  as  in  (a).  Each  node  carries 
“ mouseover  ”  information.  Many  types  of  interactive  statistics  can  be  added  using  the  P rotovis  package.  We  inter¬ 
face  to  it  using  Titan ’s  Python  library. 
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BENDERS  DECOMPOSITION  IN  PYOMO 

PATRICK  STEELE*  AND  JEAN-PAUL  WATSON1 

Abstract.  Large  scale  linear  programs  often  have  computationally  intractable  numbers  of  variables.  Decompo¬ 
sition  methods  such  as  Benders  decomposition  can  reduce  this  number,  but  require  very  specific  problem  structures 
not  always  present  in  the  original  formulation.  We  present  a  tool  that  removes  the  need  to  explicitly  construct  a 
problem  with  such  structure  through  the  use  of  the  algebraic  modeling  language  Pyomo,  a  component  of  the  Coopr 
project. 


1.  Introduction. 

1.1.  Mathematical  Programming.  A  mathematical  programming  or  mathematical  op¬ 
timization  problem  is  the  problem  of  finding  an  element  x  in  some  set  S  that  minimizes  an 
objective  function  f  :  S  — >-  R.  A  specific  class  of  such  problems  are  convex  optimization 
problems  or  convex  programs ,  which  can  be  expressed  in  the  form 


min  fix) 
s.t.  gfx)  <  bhi  =  1 


(1.1) 


where  f,gi  :  Rn  — >-  R,  i  =  1,  ...,m  are  convex  functions.  Convex  programs  are  a  notable 
class  of  problems  because  convexity  often  makes  finding  a  solution  more  computationally 
tractable  than  in  a  nonconvex  program  [5].  A  class  of  convex  programs  that  are  particularly 
well-studied  are  linear  programs ,  in  which  all  functions  are  not  only  convex,  but  linear.  Linear 
programs  are  most  often  expressed  in  matrix  notation.  If  the  variables  in  the  vector  xeRn  are 
decision  variables ,  i.e.,  the  variables  manipulated  to  minimize  the  objective  function,  then  a 
general  linear  program  has  the  form 


min  cTx 

s.t.  Ax  =  b  (1.2) 

x  >  0, 

where  the  matrix  A  e  Rmxn  and  the  vector  b  e  Rm  form  m  constraints  on  x,  and  the  elements 
ci  of  the  vector  c  eRn  are  the  cost  coefficients  of  each  variable  xt.  We  say  a  vector  x  €  Rn  is 
feasible  to  a  problem  if  x  satisfies  all  the  constraints  in  the  problem. 

Although  no  closed-form  solution  to  a  general  linear  program  is  known  to  exist,  several 
algorithms  have  been  developed  to  solve  linear  programs  in  acceptable  amounts  of  time.  One 
commonly  used  algorithm  is  the  simplex  method,  developed  by  George  Dantzig,  which  at 
worst  requires  exponential  time  to  solve  but  in  practice  is  quite  efficient  [11].  Many  interior 
point  methods,  such  as  that  developed  by  Narendra  Karmarkar,  have  polynomial  runtime 
complexity  in  the  worst  case  [10].  Linear  programming  techniques  are  used  to  solve  problems 
ranging  from  portfolio  management  [12]  to  energy  planning  and  forcasting  problems  [7]  to 
shortest  path  and  network  flow  problems  [6,  2] 

1.2.  Algebraic  Modeling  Languages.  Linear  programming  problems  often  have  a  very 
compact  linear  algebra  formulation,  such  as  (1.2).  Other  times,  the  form  of  each  constraint  in 
the  problem  varies  enough  that  forming  the  correct  matrix  and  vectors  to  represent  the  prob¬ 
lem  as  in  (1.2)  would  be  quite  tedious.  In  either  case,  explicitly  writing  out  all  the  constraints 
in  a  problem  is  often  unnecessary  for  human  understanding,  since  many  constraints  have  the 
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same  form,  differing  only  by  coefficient  values  —  in  fact,  many  problems  solved  today  in¬ 
volve  millions  of  variables  and  constraints,  making  a  hand-constructed  problem  impossible. 
To  overcome  this,  linear  programming  problems  are  often  expressed  via  algebraic  modeling 
languages. 

Algebraic  modeling  languages  (AMLs)  are  computer  languages  designed  to  express 
mathematical  ideas  in  a  form  that  can  be  understood  by  a  human.  Although  applicable  to 
a  wide  range  of  mathematics,  many  AMLs  are  specifically  designed  to  express  mathemati¬ 
cal  programming  problems.  Both  commercial  AMLs,  such  as  AMPL  (A  Math  Programming 
Language),  GAMS  (the  General  Algebraic  Modeling  System),  or  OptimJ,  as  well  as  open 
source  AMLs,  such  as  FLOPC++  or  MathProg,  are  commonly  used  to  express  linear  pro¬ 
gramming  models  in  a  form  that  can  be  manipulated  by  a  solver.  Some  AMLs  such  as  AMPL, 
GAMS,  and  MathProg  are  stand-alone  pieces  of  software  with  their  own  languages,  while 
others  are  embedded  in  general  programming  languages,  like  OptimJ  (Java)  and  FLOPC++ 
(C++). 

Another  embedded  AML  is  Pyomo  (Python  Optimization  Modeling  Objects),  a  compo¬ 
nent  of  the  Coopr  (Common  Optimization  Python  Repository)  project  [9,  1].  Pyomo  is  based 
in  Python,  an  interpreted,  object-oriented  programming  language  with  a  well-established 
user-  and  code-base.  In  addition,  Pyomo  works  cooperatively  with  other  components  of  the 
Coopr  project,  such  as  PySP,  a  stochastic  programming  toolkit,  and  CooprOpt,  a  Python  in¬ 
terface  to  many  optimizers  [8]. 

2.  Benders  Decomposition. 

2.1.  Motivation.  Let  us  consider  the  problem  of  production  planning.  A  company  man¬ 
ufactures  a  number  of  products  which  they  sell  to  their  customers.  They  have  already  taken 
orders  for  the  current  year,  and  must  fulfill  these  orders  before  the  same  time  next  year,  at 
which  time  they  will  receive  orders  for  the  second  year.  The  company  must  purchase  raw 
materials  to  produce  their  products;  the  price  of  raw  materials  is  expected  to  increase  next 
year,  so  the  company  can  decide  to  purchase  excess  raw  materials  and  store  them  for  a  fee. 

Formally,  the  decision  problem  can  be  stated  as 


min  +  fi(Wi ))  +  ^  (gkliyk)  +  gkliZk))  +  ^  h(Sk) 

ieP{  1,2}  keK  keK 

s.t.  Xi  >  Dn ,  VieP 

Wi  >  Da ,  Vz  e  P 

L  =  yk  -  Sk ,  ykeR 

ieV 

L  rk(wi)  <Zk  +  sk,  VkeR 

ieV 

Xi  >  0,  Vi  e  P 

Wi  >  0,  Vi  e  P 

yk  >  0,  VkeP 

Zk>  0,  VA:  e  % 

Sk  >0,  V£  g  P, 


(PI) 


where  P  is  the  set  of  products  the  company  manufactures  and  P  is  the  set  of  raw  materials 
used  to  produce  the  products;  x,  and  w,  are  the  decision  variables  representing  the  amount 
of  product  i  to  produce  in  year  one  and  two,  repectively,  yk  and  zk  are  the  decision  variables 
representing  the  amount  of  raw  material  k  to  buy  in  year  one  and  two,  respectively,  and  sk  is 
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the  amount  of  raw  material  k  to  store  from  year  one  to  year  two;  f]  is  the  net  cost  associated 
with  producing  and  selling  product  i,  gkj  is  the  cost  of  purchasing  raw  material  k  in  year 
j,  rk  is  the  amount  of  raw  material  k  needed  to  produce  a  material,  and  hk  is  the  cost  of 
storing  some  amount  of  raw  material  k  from  year  one  to  year  two.1  Additionally,  Di}  is  the 
demand  for  product  i  in  year  j.  The  objective  is  to  minimize  the  total  cost  of  production, 
purchasing,  and  storage.  The  first  constraint  ensures  that  demand  is  met  for  each  product 
each  year,  and  the  second  and  third  constraints  ensure  that  there  are  sufficient  supplies  to 
meet  production  requirements  in  the  first  and  second  year,  respectively.  Note  that  the  year 
one  material  constraint  is  a  strict  equality,  since  the  company  stores  all  unused  materials, 
whereas  the  second  year  is  an  inequality,  since  too  much  material  could  have  been  stored  in 
the  first  year. 

An  important  omission  from  this  model  is  the  uncertainty  of  the  year  two  demands;  recall 
the  first  constraint  from  (PI), 


*ij  >  Dij,  V(z,  j)  ePx  {1, 2}. 

This  constraint  treats  both  year  one  and  year  two  demands  as  known  parameters;  however, 
only  year  one  demands  are  known  with  certainty.  The  values  of  the  year  two  demands 
D(2,  i  e  P  lie  in  some  uncertainty  set  D,.  To  model  this  uncertainty,  we  assume  that  cus¬ 
tomer  demand  for  product  i  will  be  one  of  Dli2 ,  which  occur  with  probability 

Pm,  respectively.  This  results  in  M  different  scenarios  that  must  be  considered.  Al¬ 
low  cm  to  be  the  cost  of  purchasing  materials  given  that  scenario  m  has  occurred,  with  the 
convention  that  an  infeasible  scenario  has  infinite  cost;  then  we  can  consider  minimizing  our 
expected  cost  with  the  objective 

min  z  fifail)  +  ^  ^  gkiykl)  +  ^  ^  hk(Sk)  +  ^  ^  Pm^m-  (2*1) 

ieP  keP  keP  meM 

Of  course,  this  objective  will  obtain  a  value  of  positive  infinity  if  at  least  one  scenario  is 
infeasible.  However,  to  enforce  feasibility  constraints  on  each  of  the  scenarios,  we  must 
replace  the  random  variables  Di2 ,  i  e  P  in  (PI)  with  sets  of  variables  representing  the  planned 
course  of  action  for  each  scenario,  as  well  as  constraint  ensuring  feasibility  in  each  scenario. 
This  results  in  new  formulation  of  (PI). 


min  fTx  +  gTy  +  hTs  +  p\C\  +  . . 

•  +  PmCm 

s.t.  X 

>  d 

y-s 

=  rTx 

s  +  z1  -  r^w1 

>0 

s 

+  zM  -  rTwM  >  0 

where 

Cm  =  Yj  /‘‘K")  +  Yj  8k 

ieP  keP 

Notice  that  with  this  formulation,  each  scenario  is  only  loosely  coupled  to  the  others  via 
the  year  1  variables;  for  this  reason,  we  refer  to  year  1  variables  as  complicating  variables. 

lrThe  use  of  function  notation  here  is  for  clarity  only;  all  functions  are  assumed  to  be  linear,  and  soon  expressions 
such  as  Yi  f(xi )  will  t>e  replaced  with  the  dot  product  fTx. 
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Without  them,  each  scenario  could  be  solved  independently.  This  is  a  common  situation  in 
stochastic  programming ,  or  programming  under  uncertainty,  since  random  variables  are  often 
approximated  as  a  number  of  discrete  scenarios.  PySP,  a  component  of  the  Coopr  library, 
offers  a  toolkit  designed  to  solve  such  problems.  Because  each  scenario  introduces  another 
set  of  variables,  the  number  of  variables  in  (2.2)  often  makes  the  problem  computationally 
intractable.  To  overcome  this,  we  utilize  Benders  decomposition  [3]. 

2.2.  Formulation.  Benders  decomposition  is  a  method  designed  to  reduce  the  number 
of  variables  in  loosely-coupled  problems  such  as  (2.2).  The  primary  goal  of  Benders  de¬ 
composition  is  to  create  a  new  master  problem  containing  the  the  constraints  and  objectives 
involving  the  complicating  variables,  and  replacing  each  set  of  scenario  variables  with  a  sin¬ 
gle  auxiliary  variable.  Constraints  are  then  added  to  the  master  problem  to  ensure  that  the 
auxilliary  variables  ultimately  attain  the  value  of  the  original  objectives.  Consider  the  general 
linear  program 


cTx 

Ax 

+  fTyi  +... 

+  fTyM 

=  b 

Bix 

+  Dyi 

=  d 

B  MX 

+  Dy.w 

X 

IV  II  II 

©  a  . . 

Like  in  our  motivating  example,  x  is  a  vector  of  complicating  variables  that  loosely  couple 
the  various  yt  sets  of  variables  and  associated  cost  components  and  constraints.  In  general, 
the  x  variables  may  be  either  continuous  or  discrete.  To  solve  this  problem  via  Benders 
decomposition,  a  relaxed  master  problem  is  created, 


min  cTx  +  lTz 
s.t.  Ax  =  b, 


(2.4) 


where  zelm,  along  with  the  subproblems 

min  fTym 

s.t.  Dy  =  dw  -  Bwx*  (2.5) 

yw 

where  m  =  1 , . . . ,  M.  Note  that  in  the  master  problem  the  z  variables  are  unconstrained; 
this  is  not  a  problem,  because  or  constraint  generating  techniques  will  generate  the  relevant 
bounding  constraints. 

At  each  iteration  of  the  algorithm,  the  relaxed  master  problem  is  solved,  and  the  optimal 
value  x*  of  variable  x  is  recorded.  The  optimal  variable  values  of  x  are  then  fixed  in  each 
subproblem,  which  are  then  solved.  If  any  subproblem  m  is  dual-unbounded2  in  direction3 

2Every  linear  programming  problem  has  a  ‘dual,’  which  is  another  linear  programming  problem  closely  related 
to  the  original,  or  ‘primal,’  problem.  Briefly,  each  constraint  in  the  primal  corresponds  to  a  variable  in  the  dual,  and 
each  variable  in  the  primal  corresponds  to  a  constraint  in  the  dual.  One  can  think  of  the  dual  variables  as  penalties 
applied  to  the  primal  for  violating  constraints.  The  most  notable  traits  of  a  linear  programming  primal-dual  pair  is 
that  the  cost  associated  with  any  feasible  dual  solution  acts  as  a  lower  bound  on  the  optimal  cost  of  the  primal,  and 
that  the  cost  associated  with  any  feasible  primal  solution  acts  as  an  upper  bound  on  the  optimal  cost  of  the  dual;  this 
is  known  as  weak  duality.  In  fact,  it  can  be  shown  that  if  either  the  primal  or  dual  has  an  optimal  solution,  then  both 
problems  have  an  optimal  solution  and  the  associated  costs  are  the  same;  this  is  known  as  strong  duality.  For  a  more 
complete  treatment  of  duality  theory,  see  [4] . 

3  Given  a  problem  with  variables  x  e  Rn,  a  feasible  direction  is  a  vector  del"  such  that  x  +  6d  satisfies  the 
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yi(m) ,  constraint 


(v'(m))T  (dw  -  Bwx)  <  0 

is  added  to  the  relaxed  master  problem,  constraining  the  master  problem  to  choose  stage  1 
variables  that  do  not  allow  that  direction  of  unboundedness  to  be  feasible.  If  any  subproblem 
m  has  a  dual-optimal  solution  with  an  associated  cost  greater  than  the  value  of  zm,  the 
constraint 


(pXm))T  (dw  -  Bwx)  <  zw 

is  added  to  the  relaxed  master  problem,  constraining  the  master  problem  to  choose  stage 
1  variables  that  do  not  allow  the  optimal  cost  to  be  less  than  the  optimal  cost  of  the  dual, 
enforcing  the  condition  usually  imposed  by  weak  duality.  If  neither  type  of  constraint  is 
added  during  an  iteration,  the  current  master  variable  value  x*  is  the  optimal  stage- 1  variable 
value,  and  the  solutions  to  each  subproblem  are  the  optimal  actions  to  be  taken  should  that 
scenario  occur. 

2.3.  Representation  in  Pyomo.  To  solve  a  problem  via  Benders  decomposition,  knowl¬ 
edge  of  the  duals  of  the  subproblems  is  required.  Specifically,  we  need  to  be  able  to  deter¬ 
mine  both  the  values  of  the  dual-optimal  variables  for  each  subproblem  and  any  directions  of 
dual  unboundedness.  Although  some  full-featured  solvers,  such  as  the  commercial  CPLEX 
solver,  are  capable  of  providing  such  information,  the  Pyomo  modeling  language  is  designed 
to  be  solver-independent,  and  so  it  cannot  rely  on  this  peripheral  information  to  be  available. 
Rather,  Pyomo  must  form  the  duals  of  the  subproblems  and  then  determine  the  directions  of 
unboundedness  through  the  solution  to  an  ordinary  linear  program. 

The  first  step  Pyomo  must  take  is  to  form  the  duals  of  each  subproblem.  It  is  well  known 
that  the  primal-dual  pair  related  to  problem  (1.2)  is  given  by 


min  cTx 

max  pTb 

s.t.  Ax  =  b 

x  >  0 

s.t.  pTA  <  cT. 

(2.6) 

Note  that  the  vector  p  contains  a  decision  variable  for  each  constraint  in  the  matrix  A,  char¬ 
acteristic  of  the  dual.  Thus,  if  Pyomo  can  express  a  problem  in  the  form  of  (1.2),  it  can 
immediately  transform  the  problem  to  its  dual.  Fortunately,  transforming  a  general  linear 
programming  problem  into  the  form  of  (1.2)  is  straightforward  [4].  To  accomplish  this,  the 
following  transformations  are  made: 

1 .  For  each  variable  y  not  explicitly  constrained  to  be  nonnegative,  replace  y  with  y+  - 
y_,  where  y+,y~  >  0. 

2.  For  each  less-than  or  equal-to  constraint  Atx  <  bi ,  replace  the  constraint  with  Az-x  + 
et  =  bi ,  where  the  excess  variable  satisfies  et  >  0. 

3.  For  each  greater- than  or  equal-to  constraint  A,x  >  bi ,  replace  the  constraint  with 
A iX  -  Si  =  bi ,  where  the  slack  variable  satisfies  Si  >  0. 

4.  If  the  objective  is  of  the  form  max  cTx,  replace  it  with  min  -cTx. 


constraints  of  the  problem  for  some  positive  6  and  feasible  x.  An  unbounded  direction  is  a  direction  d  such  that 
x  +  6d  is  feasible  for  all  positive  6.  A  direction  of  improvement  is  a  direction  d  such  that  given  a  cost  vector  cel", 
cT(x  +  0d)  <  cTx  for  some  positive  6.  Thus,  an  unbounded  direction  of  improvement  is  a  direction  that  can  be  chosen 
to  produce  an  optimal  cost  that  is  unbounded  below. 
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The  next  step  is  to  be  able  to  determine  the  directions  of  unboundedness  in  the  dual  sub¬ 
problems.  If  a  dual  subproblem  is  solved  and  found  to  be  unbounded,  we  solve  the  auxiliary 
subproblem  (AUX)  described  in  proposition  2.1  to  determine  the  direction  of  unboundedness. 
Proposition  2.1.  Allow  a  linear  programming  problem  to  be  given  with  a  dual  of  the  form 


min  pTb 

s.t.  pTA  <  cT 


(D) 


If  (D)  is  unbounded,  a  direction  of  unboundedness  d*  is  given  by  an  optimal  solution  to  the 
auxiliary  problem 


min  dT0 

s.t.  dTA  <  0T  (AUX) 

dTb  >  1. 

Thus,  by  solving  these  auxiliary  subproblems  when  directions  of  dual  unboundedness  are 
required,  Pyomo  has  all  the  tools  necessary  to  solve  a  problem  via  Benders  decomposition 
using  the  simplest  of  linear  programming  solvers. 

3.  Conclusions  and  Future  Work.  At  present,  the  transformations  required  to  solve 
a  problem  via  Benders  decomposition  in  Pyomo  are  memory  intensive.  As  Pyomo  is  built 
on  an  interpreted  language,  Pyomo  already  consumes  a  large  amount  of  computer  memory 
and  the  addition  of  these  transformations  will  likely  cause  Pyomo  to  be  unable  to  solve  large 
problems. 

There  are  several  immediate  enhancements  that  can  be  made  to  conserve  memory  us¬ 
age  in  the  algorithm  outlined  above.  The  motivation  for  the  transformations  presented  was 
to  overcome  the  disparity  in  the  abilities  between  the  various  solvers  supported  by  Pyomo. 
By  making  these  transformations,  Pyomo  is  able  to  function  with  the  least-capable  solver 
available.  However,  these  transformations  become  unnecessary  when  attempting  to  solve  a 
problem  with  a  solver  capable  of  returning  both  the  dual  optimal  solutions  and  directions  of 
dual  unboundedness.  Pyomo  can  take  advantage  of  this,  performing  transformations  only 
when  the  solver  is  incapable  of  performing  them  itself. 

Additional  enhancements  can  be  made  to  improve  the  run-time  performance  of  this  tech¬ 
nique.  Note  that  only  constraints  are  added  to  the  master  problem  (2.4);  this  of  course  corre¬ 
sponds  to  adding  variables  to  the  dual,  and  so  a  dual-optimal  solution  to  (2.4)  is  guaranteed 
to  be  a  dual-feasible  solution  after  adding  constraints.  This  allows  the  solver  to  be  ‘warm 
started’  using  a  previous  solution,  both  eliminating  the  need  for  a  phase-I  simplex  solve  and 
providing  a  near-optimal  starting  basis  to  the  solver,  potentially  reducing  the  number  of  sim¬ 
plex  iterations  required  to  solve  the  problem. 

Recall  that  in  section  2.1  we  considered  a  problem  with  two  stages  of  decisions,  leading 
to  our  formulation  of  problem  2.2.  We  can  consider,  however,  the  case  where  the  various 
scenarios  of  2.2  in  turn  had  a  similar  problem  structure;  this  structure  would  lend  itself  to 
nested  Benders  decomposition ,  wherein  there  are  k  ‘stages’  to  consider,  and  each  subproblem 
is  solved  via  Benders  decomposition.  Because  Pyomo  allows  entire  models  to  be  abstracted 
behind  Block  objects,  it  is  possible  to  create  a  Benders  decomposition  method  where  the  each 
top-level  scenario  is  solved  via  Benders  decomposition,  thus  allowing  an  arbitrary  number  of 
levels  to  be  solved  via  Benders  decomposition  simply  by  having  Block  objects  contain  other 
Block  objects  with  a  structure  similar  to  (2.2). 
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Appendix 

A.  Proof  of  Proposition  2.1.  Proof.  Suppose  that  (D)  is  unbounded,  and  that  d*  pro¬ 
duces  the  optimal  value  of  (AUX).  Then  d*T  is  feasible  to  (AUX),  and  so  d*T  satisfies 

d*TA  <  0 
d*Tb  >  1. 

Recall  that  for  d*  to  be  an  unbounded  direction  in  (D),  it  must  satisfy 

(pT  +  0d*T)  A  <  cT 
d*Tb  >  0 


(Cl) 

(C2) 


for  all  6  >  0  some  feasible  p. 

We  first  note  that  since  d*  is  feasible  to  (AUX), 

d*Tb  >  1 

which  trivially  implies  that  d*  satisfies  (C2). 

We  now  show  that  d*  satisfies  (Cl).  Since  p  is  feasible  to  (D),  it  must  be  that 

pTA  <  cT  <=>  0  <  cT  -  pTA. 

Then,  since  d*TA  <  0, 

(9d*TA  <  cT  -  pTA,  V6>  >  0 
pTA  +  <9d*TA  <  cT 
(pT  +  0d*T)  A  <  cT, 


as  required. 

We  will  now  show  that  if  (D)  is  unbounded,  (AUX)  has  at  least  one  feasible  solution. 
Allow  d*  to  be  given  such  that  d*  is  an  unbounded  direction  of  improvement  in  (D).  Since 
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d*  is  an  unbounded  direction  of  improvement,  it  must  satisfy  constraints  (Cl)  and  (C2)  from 
above.  Allow  6  =  0;  then  we  have  that 

(pt  +  0Ot)a  <cT 
pTA  <  cT 

0  <  cT  -  pTA, 

and  so  all  components  of  cT  -  cTA  are  positive.  Since 

(pT  +  #d*T)  A  <  cT  <9d*TA  <  cT  -  pTA, 

we  can  immediately  see  that  if  some  component  of  d  TA,  (diTA^,  were  positive,  by  choosing 

(cT-pTA).+  l 
(d*TA), 

the  previous  constraint  is  violated.  Thus,  all  components  of  d*TA  are  nonpositive,  and  so  d*T 
satisfies  the  first  constraint  of  (AUX),  namely 

d*TA  <  0. 

If  d*Tb  >  1,  all  constraints  have  been  met  in  (AUX),  and  so  d*  is  feasible  to  (AUX). 
We  now  show  that  if  d*Tb  =  v  <  1,  a  new  unbounded  direction  of  improvement  d  can  be 

-  t 

constructed  such  that  d  b  >  1.  Recall  that  (Cl)  is  satisfied  by  d  for  all  nonnegative  6\  then 
^d*  also  satisfies  (Cl),  since  v  is  constrained  to  be  positve.  Additionally, 

1  T  1 
-d*Tb  =  -v 

V  V 

=  1 

>  1, 

and  so  ^d*  is  a  feasible  solution  to  (AUX)  and  an  unbounded  direction  of  improvement  in 
(D). 

Thus,  if  (D)  is  an  unbounded  problem,  (AUX)  has  at  least  one  feasible  solution  which  is 
an  unbounded  direction  of  improvement  in  (D).  □ 

B.  Pyomo  Formulation.  Section  B.l  shows  the  Pyomo  code  representing  problem  2.2, 
excluding  the  data  necessary  to  instantiate  the  problem.  Note  how  each  scenario  in  2.2  is 
represented  as  a  discrete  Block  object,  keeping  the  Pyomo  representation  of  the  problem 
similar  to  the  original  formulation.  Section  B.2  shows  the  PySP  code  necessary  to  solve  the 
Pyomo  model  via  Benders  decomposition.4 

B.l.  Model. 

from  coopr.  pyomo  import  * 

# 

#  Set  up  the  common  data 


4 Please  note  that  as  Coopr  continues  developing,  the  exact  syntax  of  the  Pyomo  formulation  and  PySP  solver 
code  may  change.  For  up-to-date  examples  of  Pyomo  and  PySP  code,  please  refer  to  [1]. 
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# 

master  =  Model  () 

master  .PRODUCTS  =  Set() 
master  .MATERIALS  =  Set() 
master  . SCENARIOS  =  Set() 

# 

#  Stage  1  problem 

# 

master  .  stage  1  =  BlockQ 

master  .  stage  1  .  co  sts  =  Par  am  (  master  .MATERIALS) 
master  .  stage  1  .  buy  =  Var  (  master  .MATERIALS) 
master  .  stage  1  .  produce  =  Var  (  master  .PRODUCTS) 
master  .  stage  1  .  store  =  Var() 

master  .  stage  1  .  buy_con  =  Constraint  (  master  .PRODUCTS) 
master  .  stage  1  .  store _con  =  Constraint  (  master  .PRODUCTS) 
master  .  stage  1  .  prod_con  =  Constraint  (  master  .PRODUCTS) 

# 

#  Stage  2  problems 

# 

master  .  scenarios  =  Block  (  master  .  SCENARIOS) 

master  .  scenarios  .  costs  =  Par  am  (  master  .MATERIALS, 

master  .SCENARIOS) 

master  .  scenarios  .  buy  =  Var  (  master  .MATERIALS) 
master  .  scenarios  .  produce  =  Var  (  master  .PRODUCTS) 

master  .  scenarios  .  buy_con  =  Constraint  (  master  .PRODUCTS) 
master  .  stage  1  .  prod_con  =  Constraint  (  master  .PRODUCTS) 

B.2.  PySP  Solver  Code. 

from  coopr  .  opt  .  base  import  S ol verFactory 
from  py utilib  .  misc  import  import_file 
from  coopr .  pysp  import  Benders 

solver  =  S  ol  verFactory (” glpk ”) 

problem  =  import_file  (”  benders  .  py  ”).  model 
instance  =  problem  .  create  (”  some_data_file  .  dat  ”) 

#  Specify  the  Block  holding  the  complicating  variables 

#  in  this  case,  ins tance  .  stage  1 
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solution  =  Benders  ( instance  ,  instance  .  stage  1  ).  solve  () 
print  solution 
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STOCHASTIC  OPTIMIZATION  APPLIED  TO  ENERGY  ECONOMY 
OPTIMIZATION  MODELS 

KEVIN  HUNTER^,  JOSEPH  DECAROL  IS*,  SARAT  SREEPATHP,  AND  JEAN-PAUL  WATSON11 

Abstract.  Addressing  future  uncertainties  is  a  key  challenge  with  multi-decadal  optimization  of  energy  and 
emissions  projections.  A  common  approach  creates  a  small  set  of  projections  based  on  different  exogenous  scenario 
assumptions  to  quantify  the  range  of  possible  outcomes.  While  such  scenario  analysis  can  help  expand  thinking 
about  how  the  future  might  unfold,  such  projections  are  of  limited  value  to  policy  makers,  who  must  make  long-lived 
decisions  before  uncertainty  is  resolved.  While  stochastic  optimization  has  been  applied  to  energy  system  models, 
the  computation  requirements  have  limited  analysis  to  relatively  simple  probability  trees.  However,  multi-stage 
stochastic  optimization  enables  a  more  sophisticated  approach  by  embedding  the  probability  of  different  outcomes 
within  model  formulation,  yielding  a  hedging  strategy  that  accounts  for  future  uncertainties. 


1.  Introduction.  Over  the  next  few  decades,  concerns  over  climate  change  and  energy 
security  will  drive  fundamental  changes  in  the  global  supply,  transport,  and  use  of  energy. 
Energy  system  and  integrated  assessment  models  play  a  vital  role  in  the  planning  process  by 
providing  insight  related  to  the  future  impacts  of  policies  and  technology  deployment.  Energy 
economy  optimization  (EEO)  models  attempt  to  optimize  (planned)  energy  system  develop¬ 
ment  over  a  time  horizon,  often  with  an  objective  function  that  minimizes  the  total  cost  or 
maximizes  social  utility  of  energy  supply.  Since  the  early  1990s,  EEO  models  have  emerged 
as  key  tools  for  analysis  of  energy  and  climate  policy  at  the  regional,  national,  and  interna¬ 
tional  scale.  Given  advances  in  computing  and  the  ever  expanding  core  of  mineable  energy 
and  climate  data,  the  scope  and  complexity  of  EEO  models  has  expanded  tremendously. 

The  application  scope  widening  is  a  natural  consequence  of  the  power  of  optimization 
modeling  in  general,  but  it  remains  to  be  seen  whether  increased  model  complexity  buys  a 
modeler  more  objective  insight.  The  thought  goes  that  there  is  a  positive  correlation  between 
more  complex  models  and  more  accurate  future  energy  projections.  However,  accurately 
modeling  the  future  is  akin  to  predicting  the  future,  and  little  analysis  is  currently  performed 
on  the  likelihood  of  different  outcomes.  EEO  modeling  should  provide  robust  results  in  the 
face  of  future  uncertainty;  they  have  little  value  to  policy  makers  otherwise. 

2.  An  Open  Source  Energy  Economy  Optimization  Model.  Personal  observation  and 
a  survey  of  the  literature  have  led  us  to  identify  two  critical  shortcomings  regarding  existing 
EEO  models.  With  few  exceptions,  the  models  are  not  open  source.1  Because  EEO  models 
necessarily  have  long  timeframes,  expansive  system  boundaries,  and  encompass  both  physi¬ 
cal  and  social  phenomena,  the  level  of  descriptive  detail  that  can  be  provided  in  model  docu¬ 
mentation  and  in  peer-reviewed  journals  is  insufficient  to  reproduce  a  specific  set  of  published 
results.  While  there  have  been  rigorous  efforts  to  compare  model  results2,  the  lack  of  access 
to  source  code  prevents  a  deeper  level  of  external  verification.  Peer-reviewed  journal  articles, 
model  documentation,  and  distribution  of  executable  models  provide  guidance,  but  without 
the  ability  for  researchers  to  investigate  internal  model  equations  and  assumptions,  insight 
into  specific  model  results  is  cursory,  at  best.  Dubois  noted  in  2002  that  replication  and  ver¬ 
ification  of  scientific  models  can  only  be  accomplished  by  freely  available  source  code.  [2] 
Making  EEO  models  open  source,  freely  available,  and  electronically  attainable  solves  both 
the  external  replication  and  verification  problems. 


11  North  Carolina  State  University,  [kmhunte2,  jdecarolis,  sarat_s]@ncsu.edu 
^Sandia  National  Laboratories,  jwatson@sandia.gov 

!The  (recent)  exceptions  include  Yale’s  DICE  model  http://nordhaus.econ.yale.edu/,  and  the  OSE- 
MOSYS  model  http :  //osmosys .  yolasite .  com/ 

2e.g.  the  Stanford  Energy  Modeling  Forum,  http :  //emf.  Stanford .  edu/ 
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A  second  critical  shortcoming  of  current  EEO  models  is  their  treatment  of  uncertainty. 
Either  intentionally,  or  through  project  “feature  creep,”  a  major  trend  in  current  EEO  models 
is  toward  large,  complex  models  that  endogenize  all  possible  phenomena.  The  drawback  is 
that  large  and  complex  models  are  computationally  expensive  to  iterate,  which  makes  un¬ 
certainty  analysis  difficult  at  best  to  perform.  Further,  given  the  multi-decadal  time  horizons 
of  EEO  models,  validation  through  comparative  observation  to  real  world  phenomena  is  not 
practical.  This  suggests  that  modelers  have  little  ability  to  assess  the  strength  of  model  fea¬ 
tures  toward  model  performance.  Consequently,  complex  models  are  mainly  used  to  create 
highly  detailed  scenarios,  keeping  their  applicability  limited  in  scope,  and  often  excluding 
consideration  of  insights  EEOs  may  offer  for  energy  system  analysis.  [6]. 

In  response  to  these  issues,  we  introduce  a  new  modeling  framework  called  Tools  for 
Energy  Model  Optimization  and  Analysis.  TEMOA  is  a  technology  explicit,  partial  equilib¬ 
rium  EEO  model  that  minimizes  the  present  system- wide  cost  of  energy  supply  by  optimizing 
energy  capacity  installations  and  commodity  flows  to  end-use  demands.  Technologies  are  di¬ 
vided  into  three  basic  categories:  resource  supply,  intermediate  transformations,  and  demand 
technologies. 

Energy  prices  and  flows  are  determined  endogenously  such  that  end  use  demands  are 
matched  exactly  by  energy  supply.  The  intersection  of  the  inverse  demand  curve  and  supply 
curve  represents  equilibrium,  and  this  equilibrium  is  calculated  for  every  commodity  in  the 
energy  system.  Exogenously  specified  fixed  demands  will  be  used  initially  in  TEMOA  and 
represent  a  vertical  demand  curve.  In  this  case,  the  equilibria  are  not  responsive  to  changes  in 
demand  based  on  commodity  prices.  Elasticities  can  be  specified  in  order  to  make  demand  re¬ 
sponsive  to  price,  but  in  either  case,  the  model  still  represents  a  partial  equilibrium  on  energy 
markets  because  it  does  not  take  into  account  broader  economic  effects  and  the  associated 
feedbacks  outside  of  the  energy  system.  For  a  detailed  explanation  of  the  economic  rationale 
behind  partial  equilibrium  models,  see  Loulou  et  al.  (2004). [5] 

2.1.  Formulation  as  a  Linear  Program.  The  initial  development  phase  of  TEMOA 
has  focused  on  creating  a  simplified  version  that  represents  the  energy  system  as  a  linear 
programming  (LP)  model.  LPs  are  restricted  to  have  only  linear  objective  functions  and  con¬ 
straint  equations  and  represent  a  well-studied  class  of  optimization  problems.  For  a  feasible 
set  of  assumptions  (input  parameters),  LP  algorithms  guarantee  finding  the  optimal  solution. 

The  TEMOA  objective  function  is  provided  below  and  represents  the  present  cost  of 
supplying  energy  over  the  model  time  horizon  (Ctot): 


IcostO,/,v)=  C,(t,¥) 

V  L 1  -  (1  +  >lt)  ■ 

Ucost(p,  t,  v)  =  Cmt,v,p  •  vm^Av, p  *  U /r7t,v,p 


n,t 


imatt  y  V  -I-  C /(t,v,p)  J  *  f 


tp+ 1 


Ctot  -  E  E  E  E  ( Icost(p ,  t ,  v)  +  Ucost(p ,  t ,  v))  ■ 

p  t  Y  >'=0 


(!  +  rg) 


tp+y-to 


where  the  parameters  are  defined  in  Fig.  2.1. 

There  are  two  driving  sets  to  this  equation:  model  time  periods  and  technologies.  There 
is  also  the  vintage  set,  but  the  model  internally  creates  the  vintage  set  (investment  period)  as 
an  upper  triangle  of  periods  x  periods  for  each  technology;  the  model  uses  the  first  period 
axis  to  represent  the  period  in  which  a  technology  was  built  -  the  “vintage”.  The  decision 
variables  Capicity  and  Utility  represent  the  model’s  decision  to  build  capacity  of  a  specific 
technology  in  a  certain  year,  and  the  model’s  decision  to  utilize  that  capacity.  The  bracketed 
terms  are  the  standard  economics  multipliers  to  annualize  the  technology  investment  costs  C/ 
and  to  convert  to  present-day  value.  Other  parameters  include  the  marginal  cost  of  technology 
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Q(t,v) 

C/(t,v,p) 

C*^t,v,p 

imattyv 

Capty 
\J  ^^t,Y,p 

rut 


Investment  cost  for  a  technology  and  vintage 

Fixed  cost  for  a  technology  vintage,  in  a  period 

Marginal  cost  of  operation  of  a  technology  vintage  in  a  period 

Binary  matrix;  tracks  technology  loan  investment  lifetime  by  vintage 

Binary  matrix;  tracks  technology  lifetime  by  vintage 

Variable;  installed  technology  capacity  by  vintage 

Variable;  technology  vintage  utilized  by  period 

Technology  investment  rate 

Global  discount  rate 


Fig.  2.1.  Parameters  for  the  TEMOA  objective  function. 


operation  Cm,  the  technology- specific  interest  rate  of  the  investment  rt  and  loan  life  r/,  and 
finally  the  global  discount  rate  rg.  The  precalculated  imat  and  vmat  multipliers  are  sparse, 
three  dimensional,  binary  matrices  to  track  the  investment  period  and  technology  lifetime  by 
vintage. 

Beyond  the  basic  nonnegativity  constraints  on  the  variables,  there  are  three  basic  con¬ 
straint  equations: 

•  A  technology-level  constraint  set  ensures  that  commodity  inputs  are  at  least  equal  to 
commodity  outputs  for  each  period. 

•  A  technology-level  constraint  set  ensures  that  production  cannot  exceed  the  installed 
capacity  for  each  period. 

•  A  set  of  constraints  enforces  that  all  end  use  demands  are  met  by  production  from 
appropriate  demand  technologies  for  each  period. 

With  the  variables,  objective  function,  and  constraints  defined,  instantiating  an  LP  prob¬ 
lem  from  this  model  merely  requires  specification  of  each  parameter. 

2.2.  Motivation  for  Stochastic  Program  Formulation.  “Merely”  is  deceptive,  how¬ 
ever,  as  parameter  specification  is  the  crux  of  energy  modeling.  In  the  context  of  EEO  mod¬ 
els,  LPs  implement  a  “perfect  foresight”  expectation  of  a  problem  instance;  to  solve  an  LP, 
every  input  parameter  must  be  fully  realized.  Unfortunately,  many  real  world  problems  have 
parameters  that  cannot  be  specified  until  they  are  realized.  For  example,  what  will  be  the 
price  of  fuel,  or  the  efficiency  rating  of  coal  power  plants  in  2030?  (Will  we  even  have  coal 
power  plants?)  Without  perfect  foresight,  it  is  impossible  to  know  the  route  to  an  optimal 
solution. 

Stochastic  programming  (SP)  is  similar  to  linear  programming,  but  more  formally  ad¬ 
dresses  inherent  model  uncertainty.  SPs  incorporate  data  into  the  objective  function  and 
constraints  that  may  be  uncertain.  Though  real-world  problems  have  unknown  coefficients 
(e.g.  price  of  natural  gas  10  years  in  the  future),  these  can  often  be  described  via  probability 
distributions.  For  instance,  though  we  do  not  yet  know  the  price  of  natural  gas  one  year  in  the 
future,  we  exactly  know  the  current  price.  We  also  have  a  conditional  probability  range  for 
the  price  next  year,  based  on  a  range  of  mineable  data  inputs  (e.g.  historical  usage,  planned 
business  ventures,  policy  incentives).  In  this  vein,  we  can  assign  a  probability  that  the  price 
will  increase  (for  example)  by  ±5%.  Though  we  still  do  not  know  how  the  price  will  change, 
we  can  at  least  take  action  now  to  minimize  a  worst-case  scenario  next  year. 
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One  way  to  formulate  a  stochastic  program  is  to  break  it  up  into  two  parts  -  what  we 
can  decide  now  and  what  we  can  do  after  the  consequences  of  the  decision  are  known.  This 
method  is  called  a  “recourse”  model.  More  generally,  recourse  models  try  to  minimize  the 
cost  of  the  non-anticipative  stages  and  expected  cost  of  the  “recourse”  decision[l].  Sub 
problem  2.2  is  the  recourse: 


min  cTx  +  Eg  Q(x ,  £) 
s.t.  Ax  <  b 


Q(x,£)  =  min  qTy 

Wy  =  h  -  Tx  (2.2) 

y  >0 

Here,  the  x  vector  denotes  the  non-anticipative  variables  for  which  all  input  parameters 
c  and  all  constraint  coefficients  A  and  b  are  known.  The  subproblem  contains  the  recourse 
decision  variable  y,  the  not-yet  realized  (anticipate e)  input  parameters  q,  constraint  coeffi¬ 
cients  h  and  T,  and  the  outcome  of  the  x  decision  as  the  new  decision  coefficient  matrix  W. 
Eg  denotes  mathematical  expectation  with  respect  to 

In  less  formal  math  language,  this  basically  means  that  an  SP  will  optimize  the  variables 
it  can  (the  non-anticipative  stages),  and  try  to  find  a  best  preparatory  step  for  minimizing 
the  future  cost  over  the  range  of  possible  scenarios  (expected  future  cost).  Though  the  SP 
is  “aware”  of  the  future  possibilities,  it  must  make  a  decision  without  knowledge  of  which 
future  scenario  will  actually  occur:  this  is  the  non-anticipatory  property.  All  future  stages 
can  be  resolved  only  when  more  information  about  them  is  realized,  as  by  waiting  a  year. 

Suppose  a  coal  power  plant  is  tasked  with  keeping  a  city’s  lights  on.  The  plant  furnace 
can  acquire  coal  from  either  its  own  storage  facilities  or  by  purchasing  coal  from  a  coal  mine. 
Coal  from  the  mine  costs  a  certain  amount  normally,  but  during  higher  demand  seasons,  it 
might  be  more.  It’s  a  normal  season  this  year,  but  given  past  weather  data,  it  might  also 
be  warmer  or  colder  next  year.  Due  to  overhead,  coal  costs  $0. 04/lb/year  to  store.  Finally, 
the  city’s  general  demand  for  electric  lighting  will  increase  during  warmer  or  colder  years. 
(This  information  is  summarized  in  Table  2.1.)  Given  that  next  year’s  average  temperature  is 
unknown,  how  should  the  coal  plant  plan  its  purchases? 


Scenario 

Description 

Probability 

Coal  Cost  ($) 

Demand  (units) 

1 

Normal 

% 

0.50 

100 

2 

Warmer 

% 

0.55 

110 

3 

Colder 

0.70 

140 

Table  2.1 

Example  problem  data  summary 


One  method  of  solution  replaces  the  expectation  Q(x,f)  in  the  initial  problem  with  a 
weighted  sum  of  the  second-stage  decision  variables  yt  for  each  possible  scenario.  This 
method  is  called  the  extensive  form  because  it  exhaustively  states  all  possible  scenarios  in 
the  original  problem: 
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min  0.50v  +  0.04  •  (s\  +  S2)  +  •  0.50yi  +  \  •  0.55y2  +  2/s  •  0.70^3 


0 

0 

II 

03 

1 

X 

(a) 

y  \  +  s\  -  S2  =  100 

(b) 

y2  +  s\  -  S2  =  110 

(c) 

y3  +  s\  -  S2  =  140 

(d) 

x,sus2,yi,y2,y?,  >  0 

Where  x  is  number  of  coal  units  to  purchase  this  year,  yn  are  the  number  of  coal  units 
to  purchase  for  the  respective  scenarios  in  the  following  year,  and  sn  are  the  amount  of  coal 
units  to  put  into  storage  in  year  1  and  2.  Constraint  (a)  ensures  that  demand  is  met  in  the  first 
year,  and  that  any  excess  is  put  in  to  storage.  Constraints  (b),  (c),  and  (d)  ensure  similar  for 
year  two,  but  also  take  into  account  the  left  over  storage  from  the  first  year.  (Note  that  S2  may 
“intuitively”  be  0  in  the  optimal  solution,  but  this  problem  may  more  easily  be  extended  to 
multiple  years.)  The  extensive  form  possible  outcomes  are  summarized  in  Table  2.2. 


Stage  (Scenario) 

Bought 

Stored 

Cost  ($) 

1 

200 

100 

104.00 

2  (Normal) 

0 

0 

0.00 

2  (Warmer) 

10 

0 

5.50 

2  (Colder) 

40 

0 

28.00 

Table  2.2 

Example  solution  summary 


The  optimal  expected  cost  is  104.00  +  V5  •  0  +  2/5  •  (5.50  +  28.00)  =  $1 17.40  over  the  two 
years. 

Stochastic  programs  can  also  be  thought  of  in  terms  of  scenario  trees.  Each  leaf  node  of 
the  tree  represents  a  possible  scenario.  For  a  simple  3  stage  problem  with  1  random  parameter 
with  3  discrete  outcomes,  the  tree  is  small  at  9  (32)  scenarios  (visually  depicted  in  Figure  2.2). 
But  if  there  are  x  random  variables,  each  with  y  possibilities,  to  be  solved  over  z  stages,  that 
is  leaf  nodes  or  scenarios. 


Fig.  2.2.  Example  single  random  variable  3-stage  scenario  tree 


For  EEO  models  that  necessarily  observe  long  time  horizons  and  have  between  tens  and 
thousands  of  variable  parameters,  this  represents  a  very  real  obstacle  to  calculating  useful 
results  in  a  timely  fashion. 
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2.3.  Formulation  as  a  Stochastic  Program.  While  each  constraint  of  the  LP  version 
of  TEMOA  ensures  that  the  model  meets  demand  or  production  requirements  at  all  periods, 
the  objective  function  minimizes  the  cost  over  the  entire  model  time  horizon.  As  time  is 
a  natural  stochastic  partition  point  due  to  future  uncertainty,  we  chose  to  remove  perfect 
foresight  from  TEMOA  by  converting  the  singular  objective  function  into  the  sum  of  a  set 
of  “sub-objectives”,  indexed  by  the  model  period.  Each  sub-objective  is  aware  of  its  model 
period,  the  model  decisions  of  the  previous  periods,  and  the  expectation  of  future  outcomes. 
We  implement  this  simplistically  through  the  use  of  an  additional  period-indexed  variable  pc: 


PC&ep 


Icostrt-,  t,  V)  =  C,(tv)  • 

.  1  —  (1  +  ri,t) 

Ucost(/:,  t ,  v)  =  Cmt,v,£  •  wnatt  y  k  ■  U tilt,Y,k 


n,  t 


*  imatt,v,k  ^ I  *  f 


tk+i-tk 


pc  k  =  ZEE  (. Icostik ,  t,  v)  +  Ucostik ,  t,  v))  ■ 

t  Y  >’=0 

Vke  p 


(!  +  rg) 


tk+y-to 


Note  that  this  reformulation  removes  the  outer  sum  of  the  LP  version  for  each  pc  index, 
but  replaces  the  entire  objective  function  with  this  new  sum: 


Ctot  —  ^  pc& 
ke  p 

Finally,  to  make  it  stochastic,  we  must  specifically  describe  which  stages  are  anticipatory, 
and  which  parameters  are  to  be  treated  stochastically  (to  be  decided  by  the  modeler  on  a  per- 
run  basis).  In  reference  to  equations  2.1  and  2.2,  the  anticipatory  stages  relate  to  x  and  the 
non-anticipatory  stages  relate  to  y.  Strictly  speaking,  this  reformulation  is  not  necessary,  but 
it  splits  up  the  actions  of  the  model  into  logical  parts  (for  cognitive  purposes),  and  eases  the 
next  step  to  describe  the  stochastic  problem  to  the  computer  modeling  system.  We  give  a 
brief  code  snippet  of  how  to  do  this  in  the  next  section. 

2.4.  Modeling  Environment.  Algebraic  modeling  languages  (AML)  are  computer  lan¬ 
guages  used  to  describe  optimization  problems.  There  are  a  plethora  of  AMLs  available, 
but  EEOs  have  historically  favored  AMPL3,  AIMMS4,  or  GAMS5.  The  modeling  language 
choice  is  a  critical  decision  because  it  heavily  shapes  interaction  with  developers  and  users 
alike.  We  have  chosen  to  use  Python  Optimization  Modeling  Objects  (Pyomo),  a  package 
that  provides  capabilities  often  associated  with  more  well-known  languages  such  as  AMPL, 
AIMMS,  and  GAMS.  As  the  Pyomo  application  programmer  interface  (API)  is  written  in 
the  high-level  programming  language  Python,  any  model  written  with  it  gains  access  to  a 
number  of  Python  advantages.  Among  other  boons,  Python  offers  cross-platform  portability 
and,  given  the  popularity  of  Python  in  the  scientific  community,  there  exist  a  large  number  of 
libraries  for  nearly  any  task.  [3] 


3  A  Mathematical  Programming  Language,  http :  //www .  ampl .  com/ 

4http : //www . aimms . com/ 

5 General  Algebraic  Modeling  System,  http :  //www .  gams .  com/ 
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While  Pyomo  affords  a  powerful  programming  environment  for  mathematical  optimiza¬ 
tion  problems,  it  lacks  the  compact  nature  of  pure  AMLs,  which  were  specifically  designed 
with  a  concise  syntax  that  closely  mimics  mathematical  notation.  Despite  its  verbosity,  Py¬ 
omo  is  intuitive  and  uses  the  same  model  structure  (i.e.,  sets,  parameters,  variables,  equations) 
as  AMLs.  The  example  below  is  the  Pyomo  version  of  the  TEMOA  objective  function  from 
the  SP  formulation  above: 

def  AnnualCost  (  per,  model  ): 

M  =  model 
cost  =  0.0 

for  t  in  M.tech_new: 

for  i  in  M. invest_period: 

if  (t,  i,  per)  in  M. investment : 

#  If  we  built  it  in  a  recent  time  period ,  need  to  finish 

#  paying  off  that  loan; 

cost  +=  (  M.period_spread[  per  ]  * 

M.xc[t,  i] 

*  C  M.investment_costs[t,  i,  per] 

*  M.loan_cost[  t  ] 

+  M . fixed_costs [t ,  i,  per]  ) 

) 

else: 

#  otherwise ,  if  it’s  still  operational ,  we  just  need  to  pay 

#  the  operating  costs  ( fixed  O&M) . 

cost  +=  ( 

M.period_spread[  per  ] 

*  M.xc[t,  i] 

*  M. fixed_costs[t ,  i,  per] 

) 

#  Finally,  how  much  capacity  did  we  use?  Have  to  pay  for  that  too. 
cost  +=  sum(  [ 

M.xu[t,  i,  per] 

*  M.marg_costs[t ,  i,  per] 

*  M.period_spread[  per  ] 

for  i  in  M. invest_period 

for  t  in  M.tech_all 

]  ) 

return  cost 

A  secondary  benefit  of  coding  TEMOA  against  the  Pyomo  API  is  the  tie-in  with  a  sis¬ 
ter  project  called  PySP.  PySP  extends  Pyomo  to  support  multistage  stochastic  programs  with 
enumerated  scenarios. [4]  The  PySP  package  enables  a  fairly  simple  declaration  of  the  struc¬ 
ture  of  an  SP,  and  automatically  handles  many  of  the  low-level  solving  details.  Though  each 
node  in  a  scenario  tree  requires  unique  input  data  (for  each  stochastic  parameter’s  probability 
distribution),  much  of  the  tedium  can  be  outsourced  to  a  generating  script. 

To  understand  how  to  create  a  stochastic  program  with  PySP,  here  is  a  simple  2  stage 
scenario  tree.  For  brevity,  this  snippet  only  illustrates  how  to  set  up  the  stage  variables  and 
associated  model  variables. 

param  StageCostVariable  := 
p20©0  PeriodCost [2000] 
p201©  PeriodCost [2010] 


set  StageVariables[p2000]  :=  xc[*,2000]  xu[* , 2000 , 2000]  ; 

set  StageVariables[p2010]  : =  xc[*,2010]  xu[* , 2000 , 2010]  xu[* , 2010 , 2010]  ; 


3.  Future  Work.  The  general  design  philosophy  of  TEMOA  is  to  make  the  model  just 
complex  enough  to  answer  specific  questions,  but  no  more.  The  goal  is  to  keep  the  model 
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lightweight  in  order  to  facilitate  rigorous  uncertainty  analysis.  We  have  only  very  recently 
begun  working  with  stochastic  programming  and  PySP,  but,  with  limited  computing  resources 
available  to  us,  we  have  already  identified  a  challenge  in  regards  to  the  sheer  size  of  problems. 
We  have  work  ahead  of  us  in  learning  how  to  “prune”  our  scenario  trees.  Relatedly,  scenario 
tree  generation  may  prove  to  be  an  area  of  modeler  workflow  optimization  in  that  we  currently 
manually  modify  our  stochastic  parameters.  There  is  work  underway  to  automate  and  script 
scenario  tree  generation. 

Making  more  efficient  use  of  resources  is  fast  becoming  an  issue.  Currently  we  are  only 
developing  in  singleton  runs  and  batch  sets  of  output.  However,  as  we  migrate  to  a  stochastic 
model,  the  computational  power  increase  predicted  by  Moore’s  law  will  soon  not  be  enough 
to  solve  our  model  in  a  reasonable  amount  of  time.  The  problem  of  a  mushrooming  number 
of  scenarios  and  choices  will  quickly  necessitate  a  less  naive  approach  to  a  model  solution. 

In  a  larger  sense,  uncertainty  analysis  is  only  one  component  of  the  project.  We  also 
are  working  on  a  general  EEO  database,  that  we  hope  can  be  a  basis  and  central  repository  of 
interaction  for  a  number  of  different  EEO  models.  Short  of  time  travel,  true  validation  of  EEO 
model  results  in  a  timely  fashion  will  never  be  possible.  However,  interacting  with  multiple 
EEO  models  through  a  central  hub  of  information  may  provide  a  form  of  relative  validation. 

REFERENCES 

[1]  J.R.  Birge  and  F.  Louveaux.  Introduction  to  Stochastic  Programming.  Springer  Verlag,  1997. 

[2]  R  Dubois.  Key  Techniques  for  Open-Source  Science.  In  Coalition  for  Academic  Scientific  Computation  ( CASC) 

Seminar ,  March  2002. 

[3]  W.E.  Hart.  Python  Optimization  Modeling  Objects  (Pyomo).  Operations  Research  and  Cyber-Infrastructure , 

pages  3-19,  2009. 

[4]  D  Woodruff  J  Watson.  PySP  Version  1.1,  User  Documentation,  https://software.sandia.gov/trac/ 

coopr/wiki/PySP,  2010. 

[5]  R.  Loulou,  G.  Goldstein,  K.  Noble,  et  al.  Documentation  for  the  MARKAL  Family  of  Models.  IEA  Energy 

Technology  Systems  Analysis  Programme.  Paris ,  2004. 

[6]  M.G.  Morgan,  M.  Henrion,  and  M.  Small.  Uncertainty:  a  guide  to  dealing  with  uncertainty  in  quantitative  risk 

and  policy  analysis.  Cambridge  University  Press,  1990. 


CSRI  Summer  Proceedings  2010 


214 


AN  ACRO  IMPLEMENTATION  OF  THE  HYBRID  OPTIMIZATION  ALGORITHM 

EAGLS 

JACOB  W.  ORSINI11  AND  GENETHA  A.  GRAY  ** 

Abstract.  EAGLS  is  a  hybrid  optimization  which  combines  the  global  functionality  of  genetic  algorithms  with 
the  speed  of  a  local  direct  search.  This  novel  approach  to  hybridization  has  shown  promise  when  used  on  various 
benchmarking  and  real  world  applications.  Due  to  this,  there  is  a  need  to  implement  the  algorithm  in  a  widely 
available  package,  such  as  Sandia’s  DAKOTA,  in  order  to  allow  public  use.  The  task  of  implementing  EAGLS  into 
DAKOTA  is  most  easily  done  by  using  an  existing  front  end  available  in  DAKOTA.  This  front  end,  ACRO,  has  built 
in  libraries  and  interfaces  which  allow  for  developers  to  use  pre-built  optimizers  in  order  to  test  their  own  problems 
and  create  custom  hybrid  schemes. 

1.  Introduction.  Comprehensive  studies  of  complex  systems  in  science  and  engineer¬ 
ing  benefit  from  the  inclusion  of  physical  experimentation.  However,  in  many  cases,  such 
experiments  are  prohibitively  expensive  or  even  impossible  to  perform.  To  overcome  such 
limitations,  scientists  and  engineers  take  advantage  of  the  fact  that  the  behavior  of  many  of 
these  systems  can  be  imitated  by  computer  models  and  use  computer  simulations  as  an  alter¬ 
native  or  augmentation  to  experimental  data.  Such  simulations  created  a  need  for  tools  to  help 
assess  and  improve  the  predictive  capabilities  of  numerical  models.  This  can  be  achieved  by 
pairing  the  simulations  with  optimization  models. 

Many  of  these  simulation-based  optimization  problems  contain  both  continuous  and  in¬ 
teger  or  categorical  variables.  In  fact,  within  Sandia’s  nuclear  weapons  program,  there  is  an 
abundance  of  mixed  variable  optimization  problems.  For  example,  a  discrete  variable  may 
represent  a  material  type  or  the  number  widgets  in  a  design.  Mixed  variable  models  are  also 
prevalent  in  the  area  of  critical  infrastructure  such  as  the  electrical  grid,  highway  system,  and 
water  supply  system.  Therefore,  the  research  and  development  of  optimization  algorithms 
that  can  handle  MINLPs  has  become  a  focus. 

It  should  also  be  noted  that  within  the  simulation-based  framework,  the  objective  func¬ 
tion  to  be  minimized  is  complicated  in  the  sense  that  derivatives  are  unavailable  and  approx¬ 
imate  derivatives  are  often  unreliable.  Moreover,  the  objective  function  landscapes  can  be 
disconnected,  non-convex,  non-smooth,  or  contain  undesirable,  multiple  local  minima  due  to 
the  complicated  physical  phenomena  being  modeled.  Therefore,  a  variety  of  derivative-free 
optimization  methods  have  emerged  and  matured  over  the  years  to  address  simulation-based 
problems  in  general.  They  are  theoretically  supported  and  have  established  convergence  cri¬ 
teria. 

The  evolutionary  algorithm  guiding  local  search,  EAGLS,  has  been  developed  to  address 
derivative-free  MINLPs  of  the  form 

minimize  v  e  lR"r,  z  e  Znb  f(z,  x ) 

subject  to  c(z,  x)  <  0 

ze<z<zu 

X£  <x<  xu. 

where  c(x)  :  K*  — This  method  has  shown  to  be  successful  in  common  hydrology  prob¬ 
lems  and  shows  promise  among  other  standard  tests.  Due  to  this,  it  would  be  beneficial  to 
make  the  EAGLS  available  for  a  larger  scale  of  use  by  implementing  it  within  Sandia’s  De¬ 
sign  Analysis  Kit  for  Optimization  and  Terascale  Applications  (DAKOTA).  DAKOTA  was 
chosen  since  it  provides  a  flexible,  extensible  interface  between  analysis  codes  and  iterative 
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systems  analysis  methods  [1].  This  task  is  most  easily  accomplished  by  leveraging  a  common 
repository  for  optimizers  (ACRO),  which  is  is  a  user  friendly  front  end  for  solving  optimiza¬ 
tion  problems  in  science  and  engineering  [5].  Built  into  this  software  are  robust  versions  of 
common  optimization  algorithms,  which  can  be  executed  singly  or  in  parallel  by  a  user.  This 
functionality  allows  for  the  implementing  of  custom  hybrid  schemes,  such  as  EAGLS. 

The  goal  of  this  project  is  to  implement  the  EAGLS  algorithm  using  optimizers  currently 
built  into  the  ACRO  package.  This  paper  describes  this  process  as  follows.  First,  in  Section 
2,  we  give  an  overview  of  the  EAGLS  algorithm,  focusing  on  the  characteristics  which  are 
most  important  in  its  implementation.  Section  3  summarizes  the  ACRO  package,  and  Section 
4  describes  how  EAGLS  was  implemented  as  an  ACRO  scolib  package.  Finally,  Section  5 
gives  some  general  comments  and  describes  future  work. 

2.  EAGLS.  Derivative-free  optimization  via  evolutionary  algorithms  guiding  local 
search,  EAGLS,  provides  a  novel  way  creating  an  asynchronous  parallel  hybrid  optimization 
scheme  as  seen  in  [7].  In  the  past,  derivative  free  optimization  methods  have  relied  on  relax¬ 
ation  techniques  in  order  to  approximate  discrete  variables,  such  as  integers,  using  a  similar 
continuous  function.  Due  to  the  fact  that  an  approximation  is  being  used  it  is  possible  that  the 
answer  obtained  may  not  be  the  best  possible  answer  for  the  discrete  case.  EAGLS  combines 
an  existing  heuristic  algorithm  and  direct  search  algorithm  in  order  to  efficiently  search  for 
local  optima.  The  heuristic  algorithm  used  by  EAGLS  is  the  non-dominated  sorting  genetic 
algorithm  (NSGA-II)  as  described  in  [2].  By  using  this  genetic  algorithm,  EAGLS  is  able 
to  search  the  entire  domain  for  potential  optimal  area  without  having  to  rely  on  derivatives. 
However,  using  a  GA  on  its  own  would  require  large  amounts  of  computational  power  in 
order  to  fully  run  and  get  a  set  of  solution  points.  To  alleviate  this,  the  EAGLS  algorithm  uses 
the  asynchronous  parallel  pattern  search  (APPS),  as  described  in  [3],  in  order  to  refine  the 
points  found  by  NSGA.  These  two  algorithms  were  chosen  for  this  application  as  both  have 
been  widely  tested  and  are  easily  interfaced  with  other  methods.  This  allows  NSGA  to  focus 
on  handling  integer  variables  and  global  searches  while  APPS  runs  local  searches  with  real 
variables. 

2.1.  Description  of  Algorithm.  The  algorithm  shown  in  Alg  8  is  a  synchronous  version 
of  EAGLS  which  demonstrates  the  algorithm’s  core  functionality.  In  practice,  EAGLS  is  im¬ 
plemented  asynchronously  in  order  to  distribute  computational  loads  evenly.  Step  3  through 
step  7  is  the  classical  implementation  of  a  genetic  algorithm.  Step  8  and  Step  9  are  the  local 
search  portion  of  EAGLS.  Points  of  synchronization  occur  at  fines  1  and  5,  where  new  points 
of  the  GA  are  evaluated,  as  well  as  Step  9,  when  instances  of  APPS  are  called.  To  alleviate 
the  potential  load  imbalance  that  may  occur  at  these  steps  a  shared  evaluation  queue  is  used 
so  that  the  local  search  can  run  while  waiting  for  the  GA  to  complete  its  evaluations.  A  more 
complete  example  how  how  EAGLS  works  can  be  seen  in  Figure  2.1. 

2.2.  Benefits  of  EAGLS.  The  real  power  behind  the  EAGLS  hybrid  is  that  the  algo¬ 
rithm  allows  NSGA-II  and  APPS  to  work  together  while  focusing  in  on  what  each  algorithm 
does  best.  In  doing  this,  the  GA  is  able  to  pick  promising  areas  for  global  optima  and  stop 
before  its  own  function  evaluations  begin  to  grow.  After  points  are  picked  by  the  GA,  the  lo¬ 
cal  search  then  refines  them  quickly  and  sends  the  points  back  into  the  population  in  order  to 
be  ran  through  the  algorithm  again.  In  [7],  EAGLS  in  tested  on  a  compression  spring  model, 
a  simulation-based  hydrology  application,  and  a  standard  mixed  integer  test  problem.  These 
problems  and  results  can  be  found  in  [7]  and  its  references.  Though  testing  and  additions 
need  to  be  made  for  EAGLS,  its  numerical  results  are  promising.  Along  with  this,  the  solu¬ 
tions  found  using  EAGLS  are  comparable  to  those  found  using  methods  in  each  problems’ 
literature. 
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Algorithm  8  Genetic  Algorithm  Guiding  Local  Search 
Require:  Population  size:  np 
Require:  Maximum  number  of  generations:  ng 
Require:  Budget  for  local  search:  n k 
Require:  Number  of  parallel  local  searches  desired:  ns 
1:  Generate  (evaluate  in  parallel)  and  rank  initial  population:  P\  =  pi,...,pnp 
2:  for  k  =  1, . . .  ,ng  do 
3:  Pk+ 1  =  select(P*) 

4:  Pk+ 1  =  mutate(/\+i) 

5:  Evaluate  in  parallel  new  points  in  Pk+  \ 

6:  Pk+ 1  =  merged,  Pk+1) 

7:  P'k+1  =  rank(P*+i) 

8:  Choose  first  ns  of  P'k+l  for  local  search 

minimize  f(x,  int(/?')) 

9:  Create  ns  instances  of  APPS  for  1  <  i  <  ns  subproblems:  xe^nr 

subject  to  <  x  <  xu 

10:  while  number  of  evaluations  <  n k  do 

11:  Run  APPS  instance  in  parallel  with  parallel  evaluations 

12:  end  while 

13:  end  for 

14: 


3.  ACRO.  A  common  repository  for  optimizers  (ACRO)  is  an  open  source  optimization 
package  developed  and  maintained  by  Sandia  National  Laboratories.  ACRO  is  included  in  the 
DAKOTA  optimization  package  under  the  name  Coliny  for  distribution  to  the  public.  ACRO 
is  the  combination  of  previous  optimization  libraries  developed  by  Sandia.  These  include 
the  common  optimization  library  interface  (COLIN)  and  the  Sandia  COLIN  optimizer  library 
(SCOLIB),  both  originally  developed  by  William  Hart  [6]  and  maintained  by  John  Siirola. 
These  two  libraries  contain  both  first  and  third  party  optimizers  to  be  used  with  the  ACRO 
interface.  In  addition  to  optimizers,  the  COLIN  library  contains  package  used  for  managing 
execution  of  the  objective  function,  handling  of  input  and  output,  reading  input  from  XML 
files  or  AMPL  methods  as  well  as  other  background  tasks  which  are  required  for  ACRO  to 
run.  Users  are  able  to  communicate  with  the  ACRO  package  using  either  XML  or  AMPL  via 
the  Coliny  interface.  Either  method  simply  passes  an  objective  function,  what  optimizer  is  to 
be  used,  and  whatever  constraints  and  initial  values  are  needed.  With  this  input,  Coliny  will 
handle  running  of  ACRO  automatically.  Coliny  also  allows  for  the  user  to  specify  multiple 
optimizers  in  order  to  create  a  hybrid  scheme,  provided  the  two  optimizers  have  compatable 
domains.  One  of  the  most  notable  libraries  built  into  ACRO  is  UTILIB,  a  library  of  C  and 
C++  utility  methods  similar  to  the  Boost  libraries  [4].  UTILIB  contains  generic  data  types 
which  allow  for  optimizers  and  objective  functions  that  may  not  have  perfectly  compatible 
data  types  to  talk  to  each  other,  as  the  data  goes  through  a  generic  typecasting.  With  a  deep 
understanding  of  the  aforementioned  libraries,  it  is  possible  for  a  user  to  customize  or  develop 
their  own  solvers.  In  this  case,  we  will  be  implementing  EAGLS  into  the  ACRO  package. 

4.  Implementing  EAGLS  into  ACRO.  In  order  to  implement  EAGLS  into  ACRO  it 
was  necessary  to  have  a  full  understanding  of  some  specific  libraries  built  into  ACRO.  The 
PointSet  package  is  used  by  ACRO  in  order  to  receive  and  set  points  used  by  each  individual 
optimizer.  Methods  within  PointSet  are  able  to  get  the  final  point,  which  would  most  likely 
be  the  best  answer,  or  set  the  initial  point  to  be  used  in  an  optimizer  call.  In  either  of  these 


J.W.  Orsini  and  G.A.  Gray 


217 


Fig.  2.1.  EAGLS  execution 


cases,  the  points  are  stored  in  an  appropriate  container  using  ACRO’s  generic  typecasting 
from  the  UTILIB  library.  The  solver  manager  library,  SolverMngr,  allows  for  optimizers  to 
be  easily  called,  whether  it  be  by  the  Coliny  interface  directly  or  within  another  block  of 
code.  In  the  case  of  EAGLS,  the  pattern  search  (PS)  and  generic  evolutionary  algorithm  (EA) 
in  the  SCOLIB  library  will  be  called  from  within  a  parallel  framework  using  methods  from 
the  aforementioned  packages,  without  the  worry  of  how  each  optimizer  handles  points  them¬ 
selves.  These  two  optimizers  were  chosen  due  to  their  resemblance  to  APPS  and  NSGA-II, 
which  EAGLS  was  designed  to  use.  Parallelism  is  done  using  the  message  passing  interface 
(MPI).  MPI  allows  for  the  EA  to  be  set  as  a  master  node  which  can  call  a  number  of  slave 
nodes  based  on  how  many  available  processors  are  defined  by  the  user. 

For  actual  implementation,  only  a  small  portion  of  the  code  shown  in  Alg  8  needs  to  be 
considered.  As  mentioned  previously,  the  optimizers  used  will  be  taken  directly  from  ACRO, 
so  the  extent  of  their  implementation  will  be  method  calls  and  data  passing.  Input  and  output 
is  also  handled  directly  by  ACRO  and  require  that  the  EAGLS  package  call  the  correct  data 
members  which  hold  the  data  passed  by  the  input  file.  In  order  to  make  the  program  asyn- 
chrounous,  a  queue  must  be  present  in  order  to  handle  data  going  to  the  direct  search.  For  the 
time  being,  a  queuing  system  built  into  ACRO  will  be  used  until  a  more  specialized  one  can  be 
developed.  The  cache  used  to  hold  points  found  by  the  pattern  search  until  they  are  required 
to  be  used  by  the  EA  will  be  implemented  similarly  to  the  queue.  To  implement  a  parallel, 
most  of  the  necessary  code  and  algorithms  for  MPI  are  widely  available  and  just  need  to  be 
looped  over  in  order  to  produce  the  same  results  as  in  EAGLS.  Over  the  summer  a  basic 
framework  was  created,  but  needs  to  have  further  development  before  any  tests  can  be  ran 
on  the  new  implementation  of  EAGLS.  The  full  implementation  could  not  be  completed  due 
to  the  fact  that  several  functionalities  within  ACRO  which  are  necessary  for  EAGLS  to  run 
are  untested  or  have  not  been  updated  in  recent  releases.  There  are  also  several  built  features 
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within  ACRO,  such  as  a  data  queue,  which  could  serve  the  needs  of  EAGLS  but  are  not  as 
powerful  as  the  original  EAGLS  implementation.  A  fully  functional  implementation  will  be 
due  out  in  late  fall. 

5.  Future  Work.  Currently,  EAGLS  is  under  further  development  to  include  more  fea¬ 
tures  such  as  more  appropriate  stopping  conditions  and  advanced  constraint  handling.  Having 
output  with  uncertainty  quantification,  so  a  user  can  be  aware  if  there  is  any  potential  error 
in  their  answer,  is  also  planned.  When  these  additions  are  made  to  the  main  EAGLS  code, 
it  will  be  necessary  to  implement  them  in  the  EAGLS  package  within  ACRO  as  well.  It  will 
also  be  necessary  to  fully  test  the  ACRO  implementation  of  EAGLS  against  problems  similar 
to  those  that  the  EAGLS  algorithm  was  tested  on.  Once  this  is  done,  a  functional  framework 
will  exist  which  will  allow  for  more  hybrid  schemes  to  be  easily  created  within  ACRO.  The 
way  the  code  is  currently  written  will  allow  this  simply  by  swapping  one  optimizer  function 
call  with  another. 
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Computational  Applications 

Articles  in  this  section  discuss  the  use  of  computational  techniques  in  physical  simu¬ 
lations.  Presented,  here  are  several  simulations  of  varying  physics  at  both  the  atomic  and 
continuum  level. 

Theofanis  et  al.  describe  the  electron  force  field  wavepacket  molecular  dynamics  method. 
They  show  that  this  force  field  is  able  to  reproduce  shock  pressure  for  the  material  polyethy¬ 
lene.  Ortega  and  Scovazzi  consider  a  new  remapping  approach  for  finite  element  arbitrary 
Lagrangian-Eulerian  methods  based  on  flux  corrected  transport.  They  present  results  that 
show  that  the  new  scheme  preserves  monotonicity  and  high-order  convergence  where  the  so¬ 
lution  is  smooth.  Maskey  et  al.  perform  a  molecular  dynamics  study  of  a  particular  conjugated 
nanoparticle  polymer.  Using  these  numerical  results,  the  authors  show  that  the  particle  is  in 
a  stable  collapsed  state  in  a  poor  solvent  but  unravels  in  a  good  solvent.  Anderson  et  al.  use 
density  functional  techniques  to  investigate  defects  in  amorphous  silicon  dioxide.  Takato  et 
al.  use  molecular  dynamics  to  study  the  collision  of  two  nanoparticles.  There  study  finds  dis¬ 
tinct  regimes  where  the  particles  undergo  elastic  and  plastic  deformations.  Costolanski  and 
Salinger  outline  a  method  to  model  the  performance  of  a  resonant  tunneling  diode.  Details 
about  the  Trilinos  implementation  are  presented  and  discussed. 
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AN  ELECTRON  FORCE  FIELD  STUDY  OF  SHOCKED  POLYETHYLENE 

PATRICK  L.  THEOFANIS*,  THOMAS  R.  MATTSSON*,  AND  AID  AN  P.  THOMPSON* 

Abstract.  Electron  force  field  (eFF)  wavepacket  molecular  dynamics  simulations  of  the  principal  shock  Hugo- 
niot  are  presented  for  crystalline  polyethylene  (PE)  models  along  with  a  description  of  the  eFF  method.  The  eFF 
results  are  in  reasonably  good  agreement  with  previous  DFT  theories  and  experimental  data  which  is  available  up  to 
80  GPa.  We  predict  shock  Hugoniots  for  PE  up  to  450  GPa.  In  addition  we  provide  an  analysis  of  the  phase  trans¬ 
formations  which  occur  due  to  heating.  Our  analysis  includes  ionization  fraction  and  particle  temperatures  during 
isotropic  compression.  We  find  that  above  compression  to  2.4  g/cm3  the  PE  structure  begins  to  break  down  and 
electrons  are  ionized.  Despite  using  a  simple  spherical  Gaussian  wavepacket  basis,  eFF  is  able  to  reproduce  shock 
pressures  for  this  relatively  complex  material  and  this  demonstrates  the  potential  for  eFF  as  a  method  for  studying 
matter  in  extreme  conditions 

1.  Introduction:  Studying  Materials  Under  Extreme  Conditions  With  an  Electron 
Force  Field.  The  next  generation  of  materials  sourced  for  energy  production,  high-velocity 
transportation,  military  and  medical  devices,  and  computer  hardware  will  need  to  reliably 
operate  under  extreme  conditions.  We  define  extreme  conditions  as  involving  temperatures 
in  excess  of  1000  Kelvin,  static  or  dynamic  pressures  greater  than  tens  of  mega-Pascals, 
strains  and  strain  rates  greater  than  1  km/s,  radiative  fluxes  greater  than  100  dpa,  high  inten¬ 
sity  electromagnetic  fields,  and  corrosive  or  erosive  conditions  whether  these  conditions  be 
encountered  in  combination  or  separately. 

Properties  of  hydrocarbons  under  pressure  are  of  significant  interest  for  basic  science  as 
well  as  for  practical  applications.  At  Sandia  National  Laboratories,  hydrocarbon  foams  are 
used  in  dynamical  materials  research,  within  the  Inertial  Confinement  Fusion  (ICF)  program, 
and  as  a  source  of  high  temperature  x-rays  in  radiation  research.  There  is  also  significant 
interest  in  the  properties  of  hydrocarbons  under  pressure  stemming  from  the  need  to  under¬ 
stand  the  deep  earth.  For  example,  are  there  abiotic  pathways  for  formation  of  methane  and 
longer  hydrocarbons  under  pressure,  possibly  in  combination  with  catalytic  transition  metals 
present  in  the  earth’s  mantle?  There  are  also  astrophysical  applications:  the  atmospheres  of 
the  ice  giants  Uranus  and  Neptune  contain  significant  amounts  of  methane,  water,  and  am¬ 
monia.  Understanding  properties  of  hydrocarbons  under  pressure  are  crucial  in  modeling  the 
structure  of  giant  planets  as  well  as  exoplanets. 

The  interest  in  material  responses  to  extreme  conditions  has  prompted  shock  studies  on 
a  wide  variety  of  elements  [1],  but  experimental  and  theoretical  progress  has  been  slower 
for  more  complicated  materials  like  polymers,  semiconductors,  superconductors,  etc.  Matts- 
son  and  collaborators  provided  first-principles  and  molecular  dynamics  (MD)  simulations 
of  shocked  polyethylene  (PE)  to  fill  this  research  void  [17].  PE  is  an  interesting  material 
because  of  the  wealth  of  experimental  information  about  its  thermodynamic,  elastic,  and 
shocked  properties  [9]  [26]  [18].  Typical  experimental  data  provides  the  equations  of  state  for 
materials  through  the  Rankine-Hugoniot  curves.  This  data  is  invaluable,  but  other  impor¬ 
tant  measurements  such  as  the  temperature  response  to  shock,  ionization  cross  sections  and 
rates,  and  chemical  compositions  behind  the  shock  wave  are  difficult  or  impossible  to  mea¬ 
sure.  Without  such  information  it  is  impossible  to  determine  the  types  and  rates  of  chemical 
reactions  that  occur  behind  the  shock  wave.  Quantum  and  classical  molecular  dynamics  sim¬ 
ulations  can  provide  temperatures  corresponding  to  the  Rankine-Hugoniot  equations  of  state 
but  they  cannot  provide  details  regarding  photochemical  and  photophysical  processes,  and 
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most  classical  potentials  cannot  provide  information  about  transient  chemical  species.  The 
electron  force  field  (eFF)  method,  a  semi-classical  molecular  dynamics  technique  is  capable 
of  providing  positions,  forces,  particle  velocities  and  energies  for  both  electrons  and  nuclei  in 
addition  to  the  standard  thermodynamics  quantities  [23].  We  can  simulate  the  experimental 
observables  mentioned  above  through  eFF  trajectories.  The  details  of  the  eFF  method  are 
provided  in  the  next  section  of  this  manuscript. 

In  this  paper,  we  will  describe  the  eFF  method  and  describe  its  application  to  a  hydro¬ 
static  shock  study  of  PE.  From  this  study  we  hope  to  determine  the  PE  Rankine-Hugoniot  and 
compare  it  to  experiments  and  other  theoretical  methods.  Mattsson  and  collaborators  demon¬ 
strated  that  DFT  with  the  AM05  functional  is  able  to  model  polyethylene  shock  physics  de¬ 
spite  lacking  explicit  Van  der  Waals  terms.  Like  most  flavors  of  DFT,  eFF  has  no  explicit 
energy  expression  for  Van  der  Waals  or  London  dispersion  interactions.  In  fact,  all  chemical 
phenomena  like  bond  lengths  and  strengths,  steric  effects,  charge  distributions,  conforma¬ 
tional  preferences,  and  electron  shell  filling  are  emergent  properties  in  eFF.  This  study  will 
be  an  important  test  of  eFF’s  ability  to  describe  non-covalent,  non-metallic  systems.  We  have 
calculated  material  properties  of  the  hydrostatically  shocked  PE  and  we  have  characterized 
the  chemical  composition  and  physical  phenomena  resulting  from  the  shock.  Unlike  most 
other  theoretical  methods,  eFF  provides  ionization  yields  and  a  reactive  dynamics  picture  of 
hydrostatic  compression. 

2.  The  Electron  Force  Field.  eFF  was  developed  with  modeling  matter  under  extreme 
conditions  in  mind.  Density  functional  theory  and  other  ab  initio  quantum  mechanical  tech¬ 
niques  provide  excellent  descriptions  of  matter  at  low  temperature  and  pressure  and  classical 
plasma  theories  provide  good  descriptions  of  matter  at  high  temperatures.  There  exists  a 
“computational  no-man’s  land”  between  these  thermodynamic  regimes  that  remains  a  chal¬ 
lenge  to  theorists.  eFF  was  developed  to  bridge  this  gap  and  provide  good  descriptions  of 
matter  near  its  ground  state  and  in  highly  excited  states. 

2.1.  The  eFF  Energy  Expression.  eFF  overcomes  the  difficulties  of  modeling  poten¬ 
tially  non-adiabatic  systems  by  evaluating  the  energy  of  the  system  as  a  function  of  the  nu¬ 
clear  coordinates  and  electron  coordinates  with  a  small  set  of  universal  electron  parameters. 
This  ensures  that  energy  may  be  partitioned  separately  into  nuclear  and  electronic  degrees 
of  freedom,  thus  electrons  may  hop  between  states  without  concomitant  nuclear  motion.  We 
choose  to  describe  nuclei  as  classical  particles  and  we  describe  the  electrons  with  a  wave- 
function  of  floating  spherical  Gaussian  orbitals  similar  to  the  method  of  Frost  [8].  We  define 
our  wavefunction  as  the  Hartree  product  of  floating  spherical  Gaussian  wavepackets  as  such: 


(2.1) 


j 


With  positions  x,  translational  momenta  px,  radial  size  s  and  radial  momental  ps.  Rather 
than  use  a  fully  antisymmetrized  wavefunction  (for  which  an  energy  evaluation  would  re¬ 
quire  0(N4)  operations),  we  choose  instead  a  Hartree  product  wavefunction  which  will  only 
requires  0(N 2)  operations  to  compute  the  energy.  Using  a  Hartree  product  wavefunction 
violates  the  antisymmetry  principle  for  fermions  which  requires  that  interchanging  any  two 
fermions  should  cause  the  sign  of  the  wavefunction  to  change.  In  order  to  satisfy  the  Pauli 
principle,  we  must  account  for  the  difference  in  energy  between  a  full  antisymmetrized  wave- 
function,  and  a  product  wavefunction  like  our  Hartree  product.  To  do  this  we  include  an 
explicit  Pauli  energy  term  which  will  be  described  shortly. 

The  full  potential  energy  expression  is: 


E  =  Eke  +  Ef 


'nuc-nuc 


(2.2) 
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We  define  the  component  energies  as  follows: 


Eke  ~ 


y  3  j_ 

me  E  2  s1. 
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(2.3) 
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(2.5) 


Eelec- 


elec 


(2.6) 


Epauli  =  Yj  £(TT)6  +  Z  E(U)iJ 

cr  i  —cr  j  cri±cri 


(2.7) 


where  (2.7)  comprises  the  Pauli  potential  for  same  spin  and  opposite  spin  electrons,  respec¬ 
tively.  (2.3)  describes  the  “quantum”  electronic  kinetic  energy,  which  should  not  be  confused 
with  the  classical  translational  kinetic  energy  of  the  electron.  The  error  functions  in  (2.5) 
and  (2.6)  arise  from  the  fact  that  the  electron  charges  are  “smeared”  over  the  volume  of  the 
Gaussian  sphere.  Recall  that  an  error  function  is  defined  as  the  integral  over  a  Gaussian  and 
its  argument  is  the  upper  limit  of  the  integral.  This  formulation  of  the  Coulomb  interactions 
ensures  that  the  finite  sized  spherical  Gaussians  act  like  point  charges  at  large  distances  from 
the  other  interacting  particle.  The  same  spin  Pauli  energy  function  is  defined  as: 

S 2  £2 

EiWij  =  (-^j  +  (1  -  (2-8) 

1  ^  ij  1+^0 

and  the  opposite  spin  Pauli  energy  is 

/(I -p)Stj\ 

E(U)ij  =  (  |  +S2  )AT‘J  (2-9> 

where  AT  is  the  kinetic  energy  change  upon  antisymmetrization  and  S  is  the  overlap  of  the 
wavepackets.  We  can  further  define  these  two  terms: 


h2  3/1  1\  2(3(s?  +  s2)  -  2x2j) 


>rT,  n  5  (  \  1\ 


(s2  +  s2) 2 


(2.10) 


/  2  \3//2  _9  _9 

(pT^)  exp(-^A*'  + 


(2.11) 


The  last  two  equations  contain  the  only  empirical  parameterizations  in  eFF.  We  define  p  = 
-0.2,  Xij  =  Xij  •  1.125  and  Si  =  Si  •  0.9.  These  parameters  were  fit  using  a  small  set  of 
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hydrocarbons  and  light  metal  hydrides,  p  can  be  thought  of  as  an  orthogonalization  param¬ 
eter  while  x  and  s  are  distance  and  size  scaling  parameters,  respectively.  The  Pauli  energy 
functions  in  (2.10)  and  (2.11)  are  derived  by  taking  the  kinetic  energy  differences  of  or- 
thogonalized  and  non-orthogonalized  wavefunctions.  The  Pauli  functions  are  derived  from 
E(TT)  =  Eu  -  (1  -  p)Eg  and  E(Ti)  =  ~pEg.  The  full  derivation  can  be  found  in  [23].  eFF 
uses  the  difference  between  Slater  and  Hartree  wavefunctions  for  the  “ungerade”  energy  ex¬ 
pression  and  the  difference  between  a  general  valence  bond  and  a  Hartree  wavefunction  to 
calculate  the  “gerade”  energy  expression.  The  physical  interpretation  of  this  effect  is  more 
easily  understood  in  terms  of  orthogonal  orbitals.  When  two  same- spin  electrons  approach 
one  another,  their  wavefunctions  increase  in  slope  to  decrease  their  overlap  (they  compress  in 
width).  This  increase  in  slope  increases  their  gradient  and  kinetic  energy  is  increased.  Wilson 
and  Goddard  interpret  this  change  in  energy  as  the  Pauli  repulsion  energy  [28].  (2.7)  recovers 
this  energy  and  ensures  that  eFF  electrons  satisfy  the  Pauli  exclusion  principle. 

The  beauty  of  eFF  is  in  its  simplicity.  With  only  three  empirical  parameters  it  can  repro¬ 
duce  a  variety  of  physical  quantities  like  bond  lengths,  angles,  ionization  potentials,  and  bulk 
properties.  The  simple  nature  of  the  energy  and  gradient  expressions  makes  eFF  computa¬ 
tionally  far  cheaper  than  conventional  quantum  mechanics  calculations  [23]  [24]  [25]  [12]. 

2.2.  The  eFF  Equations  of  Motion  and  Dynamics.  In  1975  Heller  demonstrated  wave- 
packet  molecular  dynamics  (WPMD)  as  a  method  for  simulating  systems  in  a  semi-classical 
manner  [10].  Rather  than  making  a  WKB  approximation,  wherein  one  assumes  that  ft  is  very 
small  [27]  [15],  he  approximated  that  the  wavepacket  exists  in  a  local  harmonic  potential.  By 
substituting  a  wavefunction  of  the  type  in  (2.1)  into  the  time-dependent  Schrodinger  equation 
with  a  harmonic  potential  Heller  derived  the  Hamilton  equations  of  motion: 


px  =  mx  px  =  -VV 


(2.12) 


These  equations  are  consistent  with  Ehrenfest’s  theorem  which  states  that  the  average  position 
of  a  wavepacket  follows  a  classical  trajectory  [7].  Following  the  same  procedure  for  the  first 
exponential  in  (2.1),  and  making  the  assumption  that  no  external  potential  exists,  we  can 
derive  the  equations  of  motion  for  the  radial  degree  of  freedom: 


3 

Ps  =  ndecs 


BE 


(2.13) 


In  eFF,  me  is  defined  in  three  places:  (2.3),  (2.10),  and  (2.13).  In  the  former  two  equa¬ 
tions,  which  correspond  to  the  electron’s  potential  energy,  me  is  defined  as  the  true  electron 
mass.  In  the  latter  equation,  it  is  a  user-defineable  parameter.  Changing  me  in  the  potential 
energy  terms  would  affect  the  sizes  and  bond  lengths  of  electrons  in  GS  atoms,  so  it  is  fixed. 
Allowing  the  “dynamical”  electron  mass  in  the  equations  of  motion  to  be  adjusted  by  the 
user  serves  a  practical  purpose:  it  allows  the  user  to  increase  the  electron’s  mass  so  that  it  is 
commensurate  with  the  nuclear  masses,  and  this  allows  the  user  to  increase  the  time  step  of 
dynamics  simulations.  It  also  makes  for  better  coupling  in  deterministic  thermostats.  This  is 
not  unprecedented;  the  Landau  theory  of  Fermi  liquids  uses  heavy  quasiparticles  that  obey 
Fermion  statistics  and  the  electron  mass  is  adjustable  in  semi-classical  theories  of  electron 
transport  in  semiconductors.  It  is  prudent  to  attempt  simulations  using  both  the  real  electron 
mass,  and  a  larger,  user-defined  electron  mass.  The  factor  of  3/4  in  (2.13)  arises  directly  from 
the  substitution  of  a  Gaussian  wave  packet  into  the  time  dependent  Schrodinger  equation. 
These  equations  are  exact  for  harmonic  potentials,  and  it  was  shown  that  they  performed  well 
for  simple  anharmonic  potentials  like  the  double  well  potential  [23].  We  can  use  the  energies 
and  forces  from  the  electron  force  field  in  conjunction  with  this  WPMD  scheme  as  a  fully 
functioning  molecular  dynamics  method. 
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3.  Construction  of  Simulation  Cells  and  Computational  Methods. 

3.1.  Polyethylene  Models.  A  crystalline  PE  model  was  created  by  taking  experimental 
PE  crystal  lattice  parameters  and  making  the  chains  finite.  Cell  parameters  were  taken  from 
[3].  The  final  cell  contained  12  x  CnH26  molecules.  In  real  samples  of  crystalline  PE  the 
chains  are  finite  in  length  and  the  PE  is  only  crystalline  in  small  domains  with  lamella  ranging 
from  70  to  300  A  in  thickness  and  extending  several  microns  laterally  [6].  The  volumes  of 
the  boxes  were  adjusted  so  that  the  densities  of  both  systems  was  0.98  g/cm3. 

3.2.  Computational  Methods.  A  version  of  eFF  incorporated  into  the  popular  and  pow¬ 
erful  molecular  dynamics  simulator,  LAMMPS,  was  used  for  all  simulations  [12]  [20]. 

To  prepare  the  cells  for  shock  simulations  a  conjugate  gradient  scheme  was  used  to  min¬ 
imize  the  potential  energy  of  each  cell.  The  search  terminated  when  either  the  linesearch  a  or 
mean  force  became  zero  with  a  tolerance  of  1  .Ox  10-6  or  if  the  energy  difference  between  each 
step  was  smaller  than  1.0  x  10~6.  Preliminary  microcanonical  ensemble  simulations  revealed 
that  a  timestep  of  0.001  fs  was  necessary  to  prevent  energy  drift  over  simulation  periods  of 
2  ps.  Two  canonical  ensembles  were  prepared  for  the  PE  model:  the  first  was  prepared  with 
NVT  using  a  Nose-Hoover  thermostat,  and  the  second  was  prepared  using  a  Langevin  ther¬ 
mostat.  System  temperature  was  obtained  from  the  kinetic  energy  of  the  atomic  nuclei.  The 
samples  were  allowed  to  equilibrate  for  an  additional  2  ps.  From  this  “ground  state”  sample, 
samples  of  densities  between  1. 3-3.0  g/cm3  were  generated  by  isothermally  and  isotropically 
compressing  the  ground  state  simulation  cell  over  a  period  of  100  fs  using  the  LAMMPS  “fix 
deform”  function  in  conjunction  with  either  the  Nose-Hoover  or  Langevin  thermostat.  A  total 
of  17  density  points  for  each  ensemble  were  generated  for  the  12  x  Ci2#26  cell  in  increments 
of  0.1  g/cm3.  These  density  points  were  also  allowed  to  equilibrate  for  2  ps  at  300K. 

3.3.  The  Principal  Hugoniot.  A  Hugoniot  curve  is  the  locus  of  thermodynamic  states 
that  can  be  reached  by  shock  compression  of  an  initial  state.  These  states  satisfy  the  Rankine- 
Hugoniot  energy  condition  [21]  [11]: 


U-U0=^(P  +  Po)(Vo-V) 


(3.1) 


where  U  is  the  internal  energy,  P  is  the  pressure  of  the  system,  and  V  is  the  cell  volume. 
The  kinetic  energy  of  the  electrons  is  defined  by  (4.1).  It  is  assumed  that  each  point  on  this 
curve  Hugoniot  corresponds  to  a  state  of  thermodynamic  equilibrium  wherein  the  stress  state 
is  hydrostatic.  For  solids,  this  latter  condition  is  only  valid  when  the  yield  stress  is  much 
lower  than  the  mean  stress  [5].  When  the  initial  state  variables  Po ,  Vo »  and  Uq  are  those  of 
the  uncompressed  sample  at  room  temperature,  the  Hugoniot  curve  is  called  the  principal 
Hugoniot. 

We  generated  states  on  the  principal  Hugoniot  using  the  following  iterative  procedure. 
First  the  volume  of  the  system  is  specified,  representing  a  particular  degree  of  compression. 
Then  the  temperature  of  the  system  is  quickly  increased  by  changing  the  set-point  of  the 
thermostat.  20  fs  of  dynamics  are  run  after  the  thermostat  jump,  during  which  averages  of 
the  energy,  temperature  and  pressure  of  the  new  state  are  obtained.  These  values  are  used  to 
evaluate  the  residual  energy  Eres,  given  by 


Eres  =  (U-  U0)  -  l-(P  +  PoXVo  -  V). 


(3.2) 


When  EresjlEkej  <  0.05  the  Hugoniot  condition  is  considered  satisfied.  If  this  inequality 
is  not  satisfied  another  iteration  is  performed.  The  new  thermostat  setpoint  is  calculated  from: 
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Polyethylene  Rankine-Hugoniot 


Fig.  4.1.  Hugoniot  data  for  the  12  x  C12//26  system.  Two  eFF  data  series  are  plotted.  In  dark  green  closed 
circles  are  the  steady-state  canonical  ensembles  computed  using  a  Nose-Hoover  thermostat.  The  light  green  closed 
circles  are  the  steady-state  canonical  ensembles  computed  using  a  Langevin  thermostat.  Experimental  data  from  the 
LASL  handbook  [1]  and  Nellis  [18]  is  provided  in  open  red  squares  and  open  orange  triangles,  respectively.  Data 
taken  from  [17]  for  the  classical  MD  potentials,  Reax,  OPLS,  and  AIREBO  is  included  for  comparison. 


Once  this  iterative  calculation  has  converged,  the  final  state  is  sampled  for  a  further  5 
ps.  This  calculation  ensures  that  the  Hugoniot  condition  is  actually  met,  and  the  standard 
deviation  in  pressure  is  used  to  provide  error  bars  for  the  principal  Hugoniot. 

4.  Results  and  Discussion.  eFF  is  a  reasonably  fast  and  highly  scalable  MD  technique. 
eFF  is  slower  than  Lennard-Jones  potentials  by  a  factor  of  1500.  For  reference,  ReaxFF 
is  slower  than  LJ  by  a  factor  of  340.  Speeds  of  5  x  10-6  seconds/timestep/particle  were 
achieved  for  the  1632  particle  12  x  C12H26  system  during  an  NVE/Langevin  equilibration  run 
on  Los  Alamos  National  Laboratory’s  “Lobo”  high  performance  computer  cluster  using  32 
processors.  More  eFF  performance  information  can  be  found  in  [12]. 

4.1.  The  Polyethylene  Principal  Hugoniot.  Figure  4.1  is  the  principal  Hugoniot  pro¬ 
jected  onto  the  pressure-density  plane.  For  intermediate  compressions  the  Nose-Hoover  ther¬ 
mostat  matched  the  experimental  and  DFT  Hugoniot  points  quite  closely.  The  Langevin 
ensemble  did  not  do  as  well,  but  still  performed  reasonable  well,  within  50%  of  the  experi¬ 
mental  value.  At  greater  compressions,  both  eFF  methods  predicted  the  shock  pressure  high 
relative  to  DFT.  The  results  show  that  eFF  is  systematically  “too  stiff’  relative  to  the  experi¬ 
mental  and  DFT/AM05  [2]  data.  However,  eFF  outperforms  several  classical  MD  potentials 
such  as  AIREBO  [22],  OPLS  [13],  and  exp-6  [4];  the  data  for  these  can  be  found  in  [17]. 
These  potentials  met  their  pressure  asymptote  below  1.7  g/cm3.  These  results  demonstrate 
the  difficulty  in  modeling  the  behavior  of  complex  materials  under  shock  compression. 

At  high  compression  interesting  features  appear  in  the  principal  Hugoniot.  In  the  AM05 
data  series  a  shoulder  feature  appears  at  2.3  g/cm3.  In  the  eFF  Nose-Hoover  data  a  series 
of  shoulders  appear  between  2.4  and  2.8  g/cm3.  The  first  of  these  appears  as  a  clear  plateau 
just  past  2.4  g/cm3.  The  next  two  features  are  more  subtle.  Two  shoulders  appear  in  the  eFF 
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Langevin  data  series.  The  first  appears  at  2.0  g/cm3  and  the  second  can  be  found  at  2.45 
g/cm3.  These  data  features  correspond  to  tangible  transitions  in  the  the  molecular  structure. 
Mattsson  found  that  the  AM05  shoulder  corresponded  to  PE  backbone  bond  breaking  [17]. 
The  possible  causes  for  the  eFF  data  features  will  be  discussed  shortly. 

4.2.  Structural  Decomposition.  The  heat  caused  by  hydrostatic  shock  induces  molec¬ 
ular  dissociation  and  ionization.  This  causes  the  polymers  to  transition  from  liquid  to  atomic 
liquid  and  finally,  at  extremely  high  shock  pressures,  to  dense  plasma.  Along  the  way  the  ther¬ 
mophysical  character  of  the  ordinarily  insulating  polymer  changes  drastically.  These  changes 
are  manifested  in  the  specific  heat  capacity,  conductivity,  and  emissivity  of  the  material.  The 
material’s  response  to  high  shocks  may  have  important  implications  for  its  use  in  certain 
environments. 


(a)  C-C  RDF 


C-H  Pair  Distribution  Function 


(c)  C-H  RDF 


(b)  H-H  RDF 


Fig.  4.2.  Radial  distribution  functions  for  various  pairs  of  nuclei.  Each  plot  contains  RDF  data  for  densities  of 
0.95,  1.5,  2.0,  2.5,  and  3.0  g/cm3.  The  H-H  RDF  (b)  also  contains  an  RDF  for  2.8  g/cm3  because  compression  to 
this  density  produces  molecular  hydrogen. 


An  analysis  of  the  radial  pair  distribution  functions  for  different  degrees  of  compres¬ 
sion  demonstrates  that  significant  structural  decomposition  occurs  upon  shock.  Figure  4.2(a) 
shows  that  carbon  bonds  are  compressed  as  the  sample  is  compressed.  The  carbon  remain 
discretely  bound  to  their  neighbors  because  a  minima  with  a  value  of  zero  is  found  just  be¬ 
yond  the  standard  C-C  bond  length  of  1.5  A  for  each  level  of  compression.  Contrast  this  to 
the  C-H  pair  distribution  function  in  Fig.  4.2(c)  where  the  data  indicate  a  phase  change  to 
an  atomic  fluid  of  hydrogens.  The  3.0  g/cm3  series  resembles  a  Fennard-Jones  fluid.  At  this 
level  of  shock  compression  the  hydrogen  are  partially  dissociated  from  the  PE  chains.  This 
behavior  is  also  seen  in  the  H-C  pair  radial  distribution  function  in  Fig.  4.2(d).  From  this  data 
we  can  conclude  that  between  compression  from  2.5  to  3.0  g/cm3  the  structure  is  shocked 
strongly  enough  to  cause  a  phase  transition  to  a  state  where  the  carbon  backbones  remain 
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Fig.  4.3.  The  percentage  of  ionized  electrons  at  different  stages  of  compression  for  the  two  eFF  ensembles. 


mostly  intact  but  are  solvated  by  loosely  associated  hydrogen  atoms.  For  compression  to  2.8 
g/cm3,  the  small  peak  in  the  H-H  data  in  Fig.  4.2(b)  near  0.7  A  shows  that  molecular  hydro¬ 
gen  is  formed.  At  this  density  the  system  has  a  shock  temperature  of  2992  K.  Mattsson  and 
collaborators  also  found  hydrogen  formation  when  their  shocked  PE  reached  2800-3100  K 
[16].  In  their  simulations  this  temperature  range  corresponded  to  densities  of  2. 2-2. 3  g/cm3. 
For  temperatures  higher  than  3100  K  the  molecular  hydrogen  becomes  to  energetic  to  stay 
bound,  and  at  lower  temperatures  the  hydrogen  do  not  have  enough  energy  to  dissociate  from 
their  polyethylene  backbone.  The  eFF  results  are  consistent  with  MD  and  DFT  results  for 
equivalent  temperatures,  but  the  structural  changes  occurring  as  a  result  of  heating  occur  at 
higher  compressions  in  eFF. 

One  of  eFF’s  greatest  assets  is  its  ability  to  separate  electron  degrees  of  freedom,  en¬ 
ergies,  positions,  momentum,  and  forces  from  those  of  the  nuclei.  This  gives  us  unrivaled 
ability  in  the  world  of  molecular  dynamics  to  measure  electronic  physical  quantities.  In  our 
investigation  of  PE  we  have  used  this  to  measure  the  ion  fraction  at  each  stage  of  shock.  To 
do  this  we  measure  the  total  electron  -  potential  and  kinetic  -  energy  of  each  electron  at  each 
timepoint  in  our  simulations.  In  eFF  an  electron’s  kinetic  energy  is  given  by: 


1  .2  13  2 

Eke  =  2m^x  +  2  4m^ 


(4.1) 


The  electron’s  potential  energy  is  measured  as  half  the  sum  of  pairwise  interactions  plus  the 
electronic  kinetic  energy  in  (2.3).  To  measure  the  correct  potential  energy  of  an  electron 
we  must  multiply  the  potential  energy  of  the  electron  in  question  by  two  and  subtract  the 
quantum  electronic  kinetic  energy  (which  was  doubled  by  multiplying  the  potential  energy 
by  two).  To  compute  the  total  energy  of  the  electron  we  add  the  electron  classical  kinetic 
energy  according  to  (4.1).  We  define  an  electron  as  being  ionized  if  its  total  energy  is  greater 
than  zero.  The  data  in  Fig.  4.3  was  computed  in  this  manner.  The  number  of  ionized  electrons 
was  averaged  over  each  steady  state  calculation. 

The  results  of  the  ionization  calculations  show  that  at  compressions  above  a  density  of 
2.4  g/cm3  the  electrons  in  PE  begin  ionizing.  This  implies  that  PE  is  conductive  above  this 
density.  Above  this  threshold  electron  ionization  draws  energy  from  the  system  and  this 
affects  the  pressure  and  temperature  of  the  Hugoniot.  The  temperature  suppression  caused  by 
ionization  can  be  seen  in  Fig.  4.1.  The  onset  of  ionization  is  likely  a  cause  of  the  plateau  and 
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Fig.  4.4.  Pressure  temperature  Hugoniot  relationships.  Two  temperatures  are  plotted  for  the  eFF  Langevin 
ensemble:  The  closed  light  green  circles  correspond  to  the  eFF  temperature  computed  according  to  (4.2),  and  the 
open  light  green  squares  correspond  to  the  temperatures  of  the  nuclei  only.  Data  for  DFT/AM05,  OPLS,  AIREBO, 
and  ReaxFF  is  also  provided  from  [17] 


shoulder  features  in  the  principal  Hugoniot. 

4.3.  Shock  Temperatures.  Before  a  discussion  of  the  results  is  presented,  the  definition 
of  temperature  in  eFF  ought  to  be  discussed.  (4.2)  is  the  default  temperature  definition  in  eFF. 


T  = 


1 


3kn  Nm 


■(K) 


(4.2) 


In  this  equation  (K)  is  the  average  kinetic  energy  of  all  the  particles  in  the  system,  and  Nnuc 
is  the  number  of  nuclear  degrees  of  freedom.  This  definition  sets  the  kinetic  contribution  to 
the  specific  heat  capacity  to  | kpT  where  only  the  nuclear  degrees  of  freedom  contribute.  This 
approximation  should  be  valid  below  the  Fermi  temperature  where  electrons  are  not  excited, 
and  thus  have  negligible  kinetic  energy.  However,  in  practical  eFF  simulations  electrons  gain 
kinetic  energy,  and  thus  a  temperature,  as  soon  as  the  temperature  of  the  system  rises  above  0 
K.  As  a  result,  the  temperature  computed  by  (4.2)  will  be  too  high. 

Temperature  is  a  vital  measure  for  any  system  because  it  dictates  the  rates  and  types  of 
chemical  reactions  that  are  occurring  within  that  system.  Mattsson  and  collaborators  evalu¬ 
ated  the  pressure-temperature  slice  of  the  PE  Hugoniot  for  several  potentials.  The  classical 
MD  potentials  were  in  near  perfect  agreement  with  DFT.  The  temperatures  presented  in  Fig. 
4.4  for  the  NVE/Langevin  data  series  are  calculated  from  the  mean  kinetic  energies  of  the 
nuclei  and  from  the  eFF  temperature  definition.  The  data  shows  that  relative  to  DFT,  eFF 
drastically  underestimates  the  temperature  along  the  Hugoniot.  To  understand  the  cause  of 
this  we  measured  temperature  in  a  more  fundamental  way.  We  conducted  an  analysis  of  the 
temperatures  of  each  degree  of  freedom  calculated  according  to  equipartition,  for  example: 

UBT,  = 


(4.3) 
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Fig.  4.5.  The  specific  heat  capacity  of  crystalline  polyethylene.  The  green  circles  show  data  computed  from 
constant  volume  heat  capacity  simulations  computed  using  NVE  with  a  Langevin  thermostat.  The  data  was  converted 
to  constant  pressure  heat  capacity  using  the  volume  and  compressibility  parameters  from  [14].  The  open  black 
squares  show  experimental  heat  capacity  data  for  polyethylene  crystal  taken  from  [9] 


for  nuclear  and  electron  translational  degrees  of  freedom,  and 


for  the  electron  radial  degree  of  freedom.  This  test  revealed  that,  when  using  the  Langevin 
thermostat,  the  temperatures  satisfied  equipartition  and  matched  the  user-specified  temper¬ 
ature  of  the  thermostat.  Though  not  shown,  we  found  that  Nose-Hoover  did  a  poor  job  of 
partitioning  energy,  even  after  5  ps  of  simulation.  It  typically  provided  nuclei  with  more  en¬ 
ergy  than  electrons,  and  this  is  likely  due  to  poor  coupling  between  the  electrons  and  heat 
bath.  This  can  be  avoided  by  using  Nose-Hoover  chains,  just  as  you  would  if  your  system 
included  heavy  metals  and  hydrogen. 

Figure  4.5,  a  plot  of  the  specific  heat  capacity  between  20  K  and  400  K,  shows  that  in 
general  eFF  overestimates  the  specific  heat  capacity  of  PE.  This  plot  demonstrates  that  the 
PE  model  is  resistant  to  temperature  change  when  energy  is  added  to  the  system.  For  systems 
like  polyethylene  which  contain  far  more  electrons  than  ions,  eFF  will  exhibit  this  behavior. 
As  discussed  before,  this  is  because  electrons  are  essentially  treated  as  classical  particles,  and 
they  obtain  kinetic  energy  as  soon  as  the  system  has  a  temperature. 

There  is  one  other  possibility  for  the  temperature  discrepancy  in  Fig.  4.4  .  In  classi¬ 
cal  molecular  dynamics  simulations  energy  can  be  distributed  into  translational,  rotational, 
vibrational  or  bonding  degrees  of  freedom  for  nuclei  only.  In  eFF  the  electron  degrees  of 
freedom  offer  additional  channels  for  energy  partitioning.  Exciting  electrons  (i.e.  ionizing 
them)  uses  energy  that  might  otherwise  go  to  the  kinetic  energy  of  the  nuclei.  This  possibility 
might  account  for  the  fact  that  the  temperature  of  the  system  is  too  low.  Figure  4.4  shows 
that  the  energy  discrepancy  exists  even  at  300  K,  where  the  system  ought  to  be  in  its  ground 
state.  As  was  mentioned  earlier,  this  is  because  the  electrons  have  finite  kinetic  energy  at  300 
K.  There  is  experimental  evidence  for  the  fact  that  electron  excitations  suppress  temperature 
increases  in  shocked  xenon  [19],  so  the  argument  presented  here  is  not  unreasonable.  Exper¬ 
imentally  measured  temperatures  do  not  exist  for  shocked  PE  so  it  has  yet  to  be  seen  whether 
eFF  incorrectly  predicts  temperature  or  not. 
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5.  Conclusions.  In  this  manuscript  we  have  simulated  the  material  response  of  PE  to 
hydrostatic  shock  compression  using  the  electron  force  field  wavepacket  molecular  dynamics 
method.  We  conclude  that  eFF  performs  well  for  high  energy  shock  simulations  with  simu¬ 
lated  Hugoniots  reproducing  experimental  and  theoretical  findings.  eFF  predicts  that  above 
2.4  g/cm3  the  polymer  backbone  will  begin  breaking  apart  and  electrons  begin  ionizing.  For 
300  GPa  shocks  significant  structural  deterioration  and  ionization  will  occur.  Temperature 
remains  an  issue  for  eFF,  and  determining  the  best  measure  of  temperature  for  eFF  will  be 
the  focus  of  continuing  work.  The  fidelity  of  the  eFF  Hugoniot  indicates  that  Van  der  Waals 
interactions  are  not  important  under  shock  conditions;  eFF  should  be  able  to  model  other 
complex  materials  with  Z  <  6.  We  hope  that  the  results  presented  in  this  paper  will  stimulate 
further  work  on  the  applicability  of  eFF  for  problems  in  high  energy-density  physics. 
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FLUX-CORRECTED  TRANSPORT  ALGORITHM  FOR  THE  REMAPPING  STEP 

OF  A  FEM  ALE  METHOD 

A.  LOPEZ  ORTEGA*  AND  G.  SCOVAZZI’ 

Abstract.  Arbitrary  Lagrangian-Eulerian  (ALE)  methods  are  widely  used  in  continuum  mechanics.  The  idea 
behind  them  is  to  use  the  complex  boundary  definitions  allowed  by  a  mesh  that  follows  particles,  while  having  the 
ability  to  track  discontinuities,  like  in  a  fixed  Eulerian  grid.  The  most  common  implementation  involves  various 
Lagrangian  steps,  where  the  mesh  moves  with  the  material  particles  and  can  be  twisted  and  deformed,  followed  by 
a  remapping  step  when  necessary,  where  the  mesh  is  modified  (without  changes  in  connectivity)  to  obtain  a  more 
regular  one  for  the  next  Lagrangian  steps.  The  remapping  procedure  must  ideally  assure  conservation  of  mass, 
momentum  and  energy  for  the  whole  system.  We  present  an  approach  for  the  remapping  algorithm  based  on  Flux- 
Correction  Transport  (FCT)  and  applied  to  a  Finite  Element  (FE)  scheme  that,  in  addition  to  global  conservation, 
preserves  positivity.  Our  test  results  show  monotonicity  preservation  (although  it  cannot  be  completely  proven)  and 
high-order  behavior  where  the  solution  is  sufficiently  smooth. 


1.  Introduction.  Flux-Correction  Transport  (FCT)  algorithms  are  commonly  used  in 
continuum  mechanics  [2].  In  FCT,  two  numerical  schemes  are  employed.  The  first  one  is 
high-order  but  can  produce  overshoots/undershoots,  oscillatory  or  unphysical  solutions,  most 
commonly  near  discontinuities.  The  second  one  is  a  low-order  scheme  and  more  diffusive  in 
character.  This  second  method  is  chosen  such  that  it  does  not  create  new  extrema  and  gives 
meaningful  results  (i.e.  the  density  cannot  become  negative).  FCT  attempts  to  correct  the 
low-order  method  by  adding  parts  of  the  high-order  one,  where  it  is  possible,  while  preserving 
monotonicity.  This  method  was  first  developed  by  D.  Book  and  J.  Boris  (see  chapter  1  of  [2]) 
and  was  mainly  used  in  Finite  Volume  and  Finite  Difference  schemes  in  compressible  and 
incompressible  fluid  dynamics. 

In  this  article,  we  present  a  FCT-based  algorithm  for  the  remapping  step  of  a  FE-ALE 
method.  The  application  of  FCT  to  Finite  Elements  is  not  trivial  and  was  first  explored  by 
L.  Lohner,  D.  Kuzmin  et  al.  (see  e.g.  [6],  [4],  [8],  [7]  and  Chapters  6,  7  and  8  of  [2]).  The 
first  one  developed  a  consistent  method  for  obtaining  the  low-order  scheme  from  a  Taylor- 
Galerkin  formulation  of  the  problem  and  an  algorithm  to  compute  the  flux  correction  based 
on  the  high-order  solution.  The  second  introduced  a  strategy  based  on  obtaining  a  low  order 
scheme  from  the  pure  Galerkin  formulation  and  estimating  the  flux  correction  from  the  low 
order  solution,  normally  by  iteration.  This  second  approach  is  closer  to  the  one  followed  in 
this  article.  They  applied  this  technique  to  the  Euler  equations  in  a  fixed  FE  grid  with  good 
results.  R.  Liska  et  al.  in  [5]  followed  the  FCT  approach  for  the  remapping  step  but  applied  it 
to  a  Finite  Difference  ALE  method. 

This  article  is  structured  as  follows.  First,  we  describe  the  Galerkin  formulation  and 
present  the  system  of  equations  to  be  solved  for  remapping  the  density,  pointing  out  the 
properties  that  we  want  to  enforce  in  the  subsequent  steps.  The  second  step  is  obtaining  the 
low-order  scheme  by  adding  diffusion  to  the  Galerkin  formulation.  After  that,  we  describe 
the  algorithm  needed  to  compute  the  correction  fluxes.  This  idea  is  extended  to  a  system 
of  equations,  in  which  we  remap  the  density,  momentum  and  total  energy  of  a  system  while 
maintaining  positive  density  and  internal  energy.  Finally,  we  present  some  relevant  test  results 
that  highlight  the  accuracy  and  benefits  of  this  approach. 

2.  Finite  Elements  for  remapping  step.  A  continuum  can  be  described  in  both  La¬ 
grangian  and  Eulerian  coordinates.  The  Lagrangian  coordinates  X  move  with  the  material 
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particles.  Eulerian  coordinates  x  are  fixed  in  space  and  can  only  track  trajectories  by  inte¬ 
gration  of  the  velocity  field.  The  one-to-one  Lagrangian-to-Eulerian  map  x  =  x(X,  t)  with 
x(X,  to)  =  X  answers  the  question  of  where  the  particle  that  was  originally  at  X  is  at  time  t  in 
spatial  coordinates.  The  gradient  of  this  map  is  the  well-known  deformation  tensor,  defined 
as  Fij  =  The  determinant  of  this  tensor  is  the  Jacobian  /  of  the  transformation. 

We  use  the  Finite  Element  Method  (FEM)  for  the  Lagrangian  step,  which  advances  in 
time,  and  the  remapping  step,  which  is  held  at  constant  time.  The  aim  of  the  remapping  step 
is  to  accurately  calculate  the  value  of  the  conserved  variables  at  the  nodes  of  the  new  mesh, 
knowing  the  values  of  these  variables  on  the  old  mesh.  In  this  process,  we  must  conserve  the 
density,  momentum  and  energy  of  the  considered  system. 

We  start  with  the  simplest  case,  a  remapping  for  a  scalar  variable  (in  this  case,  the  den¬ 
sity  p).  The  weak  integral  form  of  the  mass  conservation  is  the  following  in  Lagrangian 
coordinates  (see  [10]): 

f  <p{X)^U{X)p{X))\  dX-  f  Vx<p(X)-(p(X)w)dX  =  0,  (2.1) 

Jnx  dt  \x  JQx 

where  (f>(X)  is  a  test  function  that  satisfies  the  boundary  conditions  of  the  problem,  p  is  the 
density  and  w  =  c/F-r  with  c  being  the  relative  velocity  between  the  Eulerian  and  La¬ 
grangian  grids. 

The  remeshing  is  more  easily  described  in  terms  of  Eulerian  variables,  since  it  involves 
a  change  in  the  spatial  position  of  the  nodes.  The  last  expression  can  be  transformed  to  an 
Eulerian  frame  of  reference  using  the  map  v  =  x(X,  t)  and  in  particular  dx  =  JdX.  For  the 
first  integral  we  get: 


X 


4>(,X)-(j(X)p(X))\ 
nx  dt 


dX 


=  ~  f 

dt  Jn, 


(f>(X)p(X)J(X)dX 


=  ~  f 

dt  Jn 


(p(x)p(x)  dx ,  (2.2) 


and  for  the  second: 


f  V*0(X)- 
Jnx 


(p(X)w)  dX 


T. 


Xx(/>(x)  ■  (fi(x)c)  dx, 


(2.3) 


where  we  have  used  4L  =  ^  g  =  (F-T)ij£-, 

The  discretization  of  p  is  given  by  a  generalized  group  FE  Galerkin  formulation  [1], 
where  we  express  p(x)  =  Pifijx)  and p(x)c(x)  =  ptC^Jx),  where  i  represents  the  nodal  indices 
of  our  mesh  and  (pjx)  are  now  shape  functions  of  compact  support  (i.e.  (p(x )  =  (pjx)  with 
(pi(xj  =  1  and  (pi(xj)  =  0  for  all  j  =£  i). 

The  time  discretization  allows  us  to  obtain  a  fully  discrete  scheme.  We  choose  to  use  a 
finite  difference  scheme  for  the  temporal  derivative  of  equation  (2.2)  and  evaluate  equation 
(2.3)  at  the  half  time  step  (p"+1/2  =  (p"+1  +  p")  /2).  This  procedure  yields: 


with 


and 


vn+lpn+l 

U  rj 


in+l/2  n+l/2 

pi 


(2.4) 


vij=  f  Mjdx’ 


i  n+l/2  [  # /t+ 1 /2 y-7  jji+1/2  i 

ku  =  -u 7  •  4>j  VA  dx, 

Jnx,„+l/2) 


(2.5) 


(2.6) 
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where  u j  =  —CjAt  is  the  known  displacement  of  the  nodes  from  the  old  to  the  new  mesh.  The 
step  n  defines  the  current  mesh  while  the  n+ 1  defines  the  new  one.  This  way,  we  eliminate  the 
time  dependency  of  the  problem  in  favor  of  the  remapping  displacements.  The  negative  sign 
is  included  because  the  movement  of  the  boundaries  of  the  cell  is  opposed  to  the  movement  of 
the  flow.  Throughout  this  article,  we  assume  that  u*  is  always  known  and  given  by  a  particular 
remesh  strategy. 

2.1.  Relevant  properties  of  the  matrices  K  and  V.  Both  the  volume  V  and  flux  K 
matrices  have  important  properties  that  are  given  by  the  compact  support  shape  functions 
used  in  our  Finite  Element  scheme. 

We  define  a  lumped  volume  matrix  VL,  which  is  diagonal  and  whose  terms  are  the  sum 
of  the  terms  of  the  consistent  matrix  V  over  a  row  (i.e.  =  Yuj  vij )•  This  lumped  matrix 

conserves  the  mass  of  the  system.  It  can  be  computed  directly  by: 


The  stiffness  matrix  K  can  be  written  as  ktj  =  -u j  •  c where  c dx.  The 
matrix  C  is  skew- symmetric  and  X  /  cij  =  0-  The  sum  over  rows  of  the  stiffness  matrix 
<X/A,/  =  X/-urc,7)  is  the  discrete  Galerkin  counterpart  of  fQ  (p(x)Vx  •  u(x)  dx  (the  diver¬ 
gence)  when  only  displacements  that  are  tangential  to  the  boundary  are  allowed,  and  accounts 
for  changes  in  volume.  The  diagonal  entries  of  K  are  zero. 

These  facts  are  used  in  the  following  sections  to  prove  the  properties  of  the  low-order 
scheme  and  are  necessary  in  applying  the  flux  correction  algorithm. 


3.  Assembly  of  the  low-order  system.  Our  final  aim  is  to  solve  equation  (2.4).  This  is 
a  linear  system  and,  so,  it  can  be  solved  directly  by: 


nn+\ 


=  |V"+1  -  ;K 


n+ 1/2 


2 


(3.1) 


This  process  however  is  not  computationally  efficient  for  large  systems  and  can  give  rise 
to  unphysical  oscillations.  These  oscillations  arise  since  this  system  is  typically  second  order 
in  space  and  time  and,  so,  cannot  handle  jumps  or  shocks  without  creating  overshoots  and 
undershoots.  This  behavior  can  lead  to  unphysical  results,  like  negative  densities. 

The  solution  to  this  issue  is  based  on  the  FCT  algorithm.  For  this  approach,  first  we  need 
to  assemble  a  low-order  scheme  that  has  the  correct  properties.  In  Finite  Volume  and  Finite 
Difference  fields  a  wide  variety  of  different  methods  are  available,  depending  on  the  stencils 
used.  That  is  not  the  case  however  for  Finite  Elements,  where  the  stencil  is  defined  by  the 
shape  functions,  which  are  taken  typically  as  simple  as  possible.  Requirements  for  the  low- 
order  scheme  are  to  be  conservative,  enforce  positivity  (the  scheme  cannot  give  unphysical 
values),  and  smooth  the  extrema  of  the  function  while  not  creating  new  ones.  In  addition,  it  is 
a  FCT  algorithm  requirement  that  the  low-order  scheme  follow  the  same  time  discretization 
used  for  the  high-order  scheme. 

A  proposed  solution  for  assembly  of  the  low-order  scheme  is  to  substitute  the  volume 
matrices  V  by  their  lumped  counterparts  VL  and  substitute  the  flux  matrix  K  by  another 
matrix  L  (see  chapter  6  of  [2]): 

vf’"+ip?+i  -  vf’np'J  =  /"/1/2pf1/2.  (3.2) 

The  substitution  of  the  consistent  volume  matrix  by  the  lumped  one  adds  some  diffusion  to 
the  system.  The  low-order  flux  matrix  L  is  obtained  from  K  by  adding  a  diffusion  matrix  D 
(i.e.,  L  =  K  +  D).  The  diffusion  matrix  is  symmetric  and  X;  d[j  -  0  to  enforce  conservation. 
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Next,  we  explore  the  requirements  that  the  diffusion  matrix  must  satisfy  to  preserve  the 
positivity  and  the  Local  Extremum  Diminishing  (LED)  constraints. 

3.1.  The  positivity  constraint.  We  can  express  the  low-order  scheme  as: 


Ap"+1  =  Bp", 


(3.3) 


with  A  =  VL'"+1  -  Il«+i/2  and  B  =  VL,n  +  -jL"+l/2.  We  suppose  that  pn  is  a  vector  of  positive 
components  and  we  require  the  same  property  for  p”+1.  Sufficient  conditions  for  this  are 
btj  >  0,  ciij  <  0  for  j  A  i,  an  >  0  and  an  >  Yjj±i  &ij  (see  chapter  6  of  [2]). 

The  only  non-diagonal  contributions  to  the  B  matrix  comes  from  L.  Then,  all  the  non¬ 
diagonal  components  of  this  matrix  must  be  zero  or  positive.  This  can  be  obtained  if  D  has 
positive  off-diagonal  components.  Negative  diagonal  components  are  then  needed  to  preserve 
the  null  sum  over  rows.  The  conditions  an  >  0  and  an  >  Y^j±i  aij  are  automatically  satisfied 
because  the  lumped  volume  matrix  is  a  diagonal  matrix  with  positive  components  and  the 
diagonal  components  of  L  are  negative  (the  diagonal  entries  of  K  are  zero  and  D  has  non¬ 
positive  diagonal  terms). 

The  last  constraint  is  that  the  diagonal  terms  of  B  must  be  positive.  This  is  not  auto¬ 
matically  satisfied  and  depends  on  the  displacement  field  chosen  for  the  problem.  This  is  a 
CFL-like  condition  and  appears  in  every  method  that  includes  terms  at  the  previous  step  n  in 
the  right-hand  side  of  Equation  (3.2). 

In  conclusion,  we  need  to  generate  a  low-order  stiffness  matrix  whose  non-diagonal  com¬ 
ponents  are  positive.  A  simple  algorithm  for  achieving  this  is  the  following:  we  initialize 
ltj  =  ktj  and,  in  a  loop  for  all  non-diagonal  components  above  (or  below)  the  principal  diago¬ 
nal,  we  take  dij  =  max(-&;7-,  -kju  0)  and  we  update: 


(3.4) 


Note  that  the  diagonal  components  are  updated  each  time  one  non-diagonal  component  of  the 
same  row  or  column  is  modified.  This  algorithm  is  similar  to  the  one  used  by  Kuzmin  et  al.  in 
chapter  6  of  [2]. 

3.2.  The  Local  Extreme  Diminishing  (LED)  constraint.  The  low-order  scheme  can¬ 
not  create  new  local  extrema.  When  the  volume  matrices  at  the  step  n  +  1  and  n  are  the  same 
(i.e.,pure  advection),  this  is  achieved  just  with  the  conditions  we  introduced  for  the  low-order 
stiffness  matrix  L.  To  prove  this,  we  start  with  a  semi-discrete  formulation: 


(3.5) 


where  we  have  used  the  fact  that  for  pure  advection,  the  sum  over  a  row  of  K  is  zero.  It  is 
easy  to  see  that  when  pi  is  a  minimum  dpi  >  0,  and  when  it  is  a  maximum  dpi  <  0,  because  all 
the  non  diagonal  components  of  the  low-order  stiffness  matrix  are  positive.  We  use  compact 
support  shape  functions  and,  in  consequence  the  non-diagonal  terms  only  relate  neighboring 
nodes,  so  the  smoothing  happens  for  all  local  extrema. 

The  remapping  scheme  however  deviates  from  this  structure.  In  this  case,  we  have  two 
different  volume  matrices  at  pseudo-time  n  +  1  and  n ,  and  a  non-zero  sum  over  rows  of  the 
stiffness  matrix.  At  a  fully  discrete  level,  we  can  write  for  our  low-order  scheme: 


L,n 


(3.6) 
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Apparently,  the  two  extra  terms  at  the  end  of  the  expression  could  ruin  the  condition  for 
the  LED  constraint  described  in  Equation  (3.5).  An  analysis  of  them  reveals  that  they  both 
account  for  changes  in  volume.  The  first  one  is  just  the  difference  between  the  initial  and 
final  volumes  divided  by  the  final  volume.  The  sum  over  the  rows  of  kij/v^,n+l  is  a  discrete 
expression  of  the  divergence  of  the  remesh  displacement  field,  which  also  defines  the  change 
in  volume. 

We  now  introduce  a  new  desirable  property.  We  would  like  to  obtain  a  constant  remapped 
density  when  the  input  is  a  constant  density,  even  using  the  high-order  scheme.  If  we  sub¬ 
stitute  this  requirement  in  Equation  (2.4)  and  sum  over  rows,  the  following  expression  is 
obtained: 


o  =Po 


f  ■ 

JaA+i) 


<f>i(x,tn+i)dx 


I  0z(^>  tn 

X(tn ) 


)  dx 


z£ 

j  J^tn+ 1/2) 


07^jc0«(^?  tn+l/2)  '  U j  dx 


(3.7) 


We  can  achieve  this  constraint  by  computing  the  lumped  volume  not  by  quadrature  of  the 
shape  functions  but  from  the  previous  expression,  i.e.  vf,n+l  =  vf,n  +  Ej  hj •  It  can  be  proven 
that,  if  the  shape  functions  and  the  displacement  field  are  linear  in  space  and  time  respectively, 
we  need  an  integration  formula  in  time  with  polynomial  exactness  equal  to  the  number  of 
dimensions  n^-l  to  satisfy  equation  (3.8)  (see  Chapter  8  of  [10]).  Therefore,  for  a  ID  scheme 
with  piecewise  linear  elements  (c^+i  =  -1/2  and  citi- 1  =  1/2),  we  can  demonstrate  that  this 
condition  is  accomplished  just  by  constructing  the  matrices  from  shape  function  quadrature 
no  matter  what  point  in  time  we  take  to  construct  K  (the  fact  is  that  the  flux  matrix  will  be  the 
same  no  matter  what  pseudo-time  point  between  n  and  n  +  1  we  take  for  evaluation): 

Vf’"+ 1  -  vf’"  =  (x”++/  -  *»+1  -  +  x"_l)  =  \  (ui+ 1  -  Ui- 1)  =  *y+ 1  +  hi-  1.  (3.8) 

For  a  2D  scheme,  we  guarantee  that  the  matrices  constructed  by  quadrature  of  the  shape 
functions  satisfy  the  geometric  conservation  law  if  we  evaluate  K  at  time  n  +  1/2.  In  the 
case  of  the  3D  scheme,  we  need  two  quadrature  points  in  time.  The  code  that  was  developed 
takes  into  consideration  this  conservation  law  and  uses  the  quadrature  points  that  are  required 
to  compute  the  remesh  volume  matrix.  Then,  the  }^n+1/2pn+l/2  right-hand  side  term  can  be 
rearranged  as  a  function  of  these  two  quadrature  points. 

Due  to  the  enforcement  of  a  constant  solution  by  the  geometric  conservation  law,  the 
LED  constraint  is  automatically  satisfied.  We  show  this  by  writing: 


and  taking  the  last  term  of  the  right-hand  side  to  the  left-hand  side: 


pr 1  -p"  =  E 


2  in 


j*i  i 


L,n+l  /i  ,  \ 

V,.’  (1  +Ki) 


(  n+l/2  n+ 1/2\ 

\Pj  ~Pi  )’ 


(3.10) 


Note  that  this  expression  is  equivalent  to  (3.5)  but  with  a  modified  volume  term  (v^’n+1  +  /2. 

We  must  point  out  that  there  can  be  a  small  monotonicity  preservation  violation  when 
computing  the  low-order  solution  using  a  predictor-multi-corrector  algorithm.  We  propose  a 
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possible  solution  strategy: 


^’"+1(p"+1’(m+1)  -  vL’V") 

=  2  i:;m  (pnj+l/2’(m)  -  p;+l/2’(m) )  -  f  2  c]  ^  (p-  -  pr1,(M))  - 


7# 


(3.11) 


where  m  defines  the  number  of  iterates  of  the  predictor-multi-corrector  strategy.  One  can 
argue  that  the  last  term  of  the  right  hand  side  can  be  moved  to  the  left  hand  side  with  m  +  1 
instead  of  m.  This  will  violate  the  conservation  properties  however  since  we  would  be  using 
two  different  sets  of  temporal  values  to  evaluate  the  matrix  L.  Consequently,  the  scheme 
shown  above  is  the  only  iterative  one  that  preserves  conservation  at  every  step.  Rearranging 
terms  and  using  the  equivalency  between  the  volume  matrices  and  the  sum  over  rows  of  the 
flux  matrix,  we  obtain: 


yL,n+ 1  _|_  yL,n 


Hn+\,(m+ 1)  _  +  ^n+l,(m+l)  _  ^n+l,(m)\\ 


Zrn+ 1/2  /  n+l/2,(m) 

lu  {Pj  Pi 


n+l/2,(m)\ 


j±i 


(3.12) 


The  second  term  of  the  left  hand  side  introduces  a  possible  perturbation  in  monotonicity 
depending  of  the  difference  between  iterations.  In  practice,  this  term  tends  to  zero  as  the 
iteration  converges  and  possibly  can  be  damped  sufficiently  by  the  excessive  diffusion  of  the 
low  order.  In  our  simulations,  we  have  never  noticed  overshoots/undershoots  due  to  this  term 
after  running  the  predictor-multi-corrector  for  three  times. 

4.  Antidiffusive  flux  computation.  The  low-order  solution  is  in  general  overly  diffu¬ 
sive,  while  the  high-order  solution  can  create  new  extrema  or  be  physically  incorrect.  The 
secret  of  FCT  relies  in  adding  a  weighted  amount  of  the  high-order  solution  to  the  low-order 
one.  To  achieve  this,  we  first  must  compute  an  estimate  of  the  difference  between  the  two 
schemes,  called  the  raw  antidiffusive  fluxes.  Suppose  for  the  moment  that  the  low-order 
scheme  could  have  been  computed  using  a  different  time  discretization  from  that  of  the  high- 
order  solution,  to  prove  that  they  have  to  be  the  same  in  order  to  use  the  FCT  algorithm. 
Then, 


p. = 2  (te+1/2  -  pi+l'2)  -  it!e  fa*  -  v ,+e )) + vtj  {p')-pni))+- 


j*i 


...  +  2  c2pr1/2  -  Z  ku^+e + vin+1v+l  -  viilpnj+1 . 


(4.1) 


where  pt  is  the  density  solution  of  the  high-order  scheme  and  pi  is  the  solution  of  the  low- 
order  one.  If  we  add  this  flux  P  to  the  low-order  scheme,  we  recover  the  high-order  one.  The 
solution  we  would  like  is  an  intermediate  between  the  two:  it  must  preserve  positivity  and 
LED  but  it  must  be  as  minimally  diffusive  as  possible.  To  achieve  this,  the  antidiffusive  fluxes 
are  applied  after  weighting  them  from  0  to  1  (the  first  value  corresponding  to  the  low-order 
solution  and  the  last  to  the  high-order).  Zalesak’s  algorithm  is  widely  used  for  this  purpose 
and  will  be  described  in  the  next  section. 

Zalesak’s  algorithm  however  imposes  some  restrictions  to  the  form  of  Pt.  First,  we  need 
to  write  Pt  =  2 j±i  fij>  with  the  property  ftj  =  -fy.  This  can  only  be  achieved  if  we  use  the 
same  density  (either  the  low-order  or  the  high-order  one)  to  estimate  the  flux.  The  structure 
of  the  consistent  and  lumped  volume  matrices  calls  for  this.  In  addition,  we  need  6  =  1/2: 
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this  value  is  chosen  to  guarantee  the  skew-symmetry  property  of  the  fluxes  that  cannot  be 
achieved  otherwise  (we  know  that  K  -  L  =  -D  and  that  dij(pj  -pi)  =  -djiipt  -  pj)).  With 
these  assumptions,  each  individual  flux  reads: 


fij(pn+\pn ) 

(pn+l  +pn 

-  _//«+1/2  n _ n_ 

v  {  2 


(4.2) 


5.  An  iterative  algorithm  for  the  remeshing  step.  An  iterative  algorithm,  similar  to 
that  proposed  by  Kuzmin  and  Moller  in  Chapter  6  of  [2],  can  be  implemented.  This  algorithm 
computes  the  antidiffusive  flux  in  a  few  steps.  At  each  iteration,  we  obtain  a  new  positivity 
preserving  and  LED  solution  that  becomes  a  “new”  low-order  solution  for  which  we  can 
compute  new  antidiffusive  fluxes.  As  the  iteration  progresses,  the  amount  of  additional  flux 
needed  decreases. 

First,  consider  the  partial  problem: 


vf^1/2pi  =  v^p?  +  l^1/2pj. 

(5.1) 

with  vf’"+1/2  =  (vf’n+1  -  vL,n )  /2.  That  is,  we  obtain  an  explicit  low-order  solution  for  the 
density  at  the  half-step  (p  =  pn+1/2).  This  solution  can  be  updated  by  adding  antidiffusive 
fluxes.  This  is  going  to  be  performed  in  various  steps.  At  each  of  them,  we  solve: 

L,n+l/2~(m+l)  _  Am) 
z  Pi  z  ’ 

(5.2) 

where  b(m)  is  defined  as: 

bf  =  bf~X)  + 

j 

(5.3) 

with: 

b !0)  =  vf’V"  +  f 

(5.4) 

and 

a  Am)  _  Am)  _  Jm) 

Jij  hj  °ij  ’ 

(m)  _  (m-1)  (m)Af(m) 

°ij  °ij  aij  Jij  ' 

(5.5) 

(5.6) 

This  way,  at  each  step  of  the  iteration,  we  add  the  flux  corresponding  to  the  difference  between 
what  is  needed  and  what  was  previously  added.  The  antidiffusive  fluxes  are  computed  using 
Equation  (4.3)  with  the  last  iterate  solution  as  the  estimate  for  the  densities.  The  solution 
of  (5.2)  remains  positivity-preserving  as  long  as  the  data  at  time  step  n  is  monotone  and 
the  additional  fluxes  are  calculated  using  Zalesak’s  limiter.  The  Afy  values  shrink  when 
the  number  of  performed  iterations  increase.  Normally,  a  very  low  number  of  iterations  is 

needed  to  get  acceptable  results.  Note  that  this  algorithm  remeshes  exactly  a  constant  solution 
only  if  the  volume  matrix  at  the  half  step  is  used  because  v^,w+1/2  =  (vf,n+l  +  vf,n^j  /2  = 
vL,n  +  \  Yuj  from  Equation  (3.8). 

After  this,  we  update  the  solution  to  the  whole  step.  This  is  achieved  by  solving: 
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This  can  be  done  iteratively  by: 


(5.8) 


where  the  first  guess  is  given  by  the  last  density  value  obtained  in  the  previous  loop.  We  note 
that  monotonicity  could  be  compromised  if  we  use  a  low  number  of  iterations  (see  Equation 
(3.12)).  A  typical  number  of  iterations  for  this  step  is  three. 

Finally,  we  can  add  a  bit  more  antidiffusion  to  the  solution  by  using  the  same  multistep 
algorithm  that  was  used  at  the  half  step: 


(5.9) 


where  Afy  =  fij  -  gij.  The  best  results  are  obtained  when  gtj  is  not  reset  for  this  iteration  and 
is  kept  with  the  value  it  had  after  the  half-point  antidiffusion  addition. 

The  number  of  iterations  that  need  to  be  run  to  get  accurate  results  varies  with  the  prob¬ 
lem.  A  pure  advection  problem  looks  better  as  we  add  more  iterations  at  the  half-step  or  at 
the  end.  This  is  not  usually  the  case  for  the  remesh  (typically,  the  one-dimensional  remesh 
proposed  in  [5]).  Normally,  a  strategy  consisting  of  one  half-step  and  one  final  antidiffusion 
addition  yields  good  results  for  straight  shocks.  For  other  tests,  like  a  triangle  ID  function 
or  a  bump,  it  can  produce  terracing  (the  final  function  looks  like  a  ladder).  If  the  half-step 
antidiffusion  addition  is  eliminated,  the  results  are  more  robust  (i.e.  no  appreciable  terracing 
is  observed),  but  somewhat  more  dissipative.  We  have  observed  that  using  more  than  one 
iteration  for  any  of  the  antidiffusive  steps  normally  gives  poor  results  for  continuous  func¬ 
tions.  The  algorithm  tries  to  convert,  for  example,  a  bump  function,  into  a  square  or  terraced 
function.  Finally,  if  the  final  antidiffusive  flux  addition  is  not  used,  the  solution  looks  less 
sharp  for  straight  discontinuities,  producing  a  “shark  fin”  shape. 

In  conclusion,  the  two  best  strategies  are  the  one  half-step  and  one  final  antidiffusion 
addition,  which  is  less  diffusive  but  can  produce  terracing  in  some  cases  and  the  final  addi¬ 
tion  of  antidiffusion,  which  gives  a  very  robust  algorithm  and  is  probably  more  suitable  for 
multivariable  problems. 

5.1.  Zalesak’s  algorithm.  Zalesak’s  algorithm  [12]  is  used  to  obtain  the  weights  o^-  in 
a  consistent  way.  Before  formulating  the  algorithm,  we  can  consider  setting  to  zero  all  the 
antidiffusive  fluxes  that  satisfy  A fijipj  -  pi)  <  0.  This  is  because  in  this  case,  the  differential 
fluxes  are  adding  diffusion  instead  of  eliminating  it.  The  aim  of  the  algorithm  is  to  preserve 
the  positivity  and  FED  constraints  by  adding  pieces  of  the  high-order  solution  where  possible 
(i.e.  smooth  regions).  To  do  this,  we  first  split  the  total  flux  Pt  in  positive  and  negative  fluxes: 


(5.10) 


This  way,  we  estimate  the  possibility  of  overshooting  a  maximum/minimum  of  the  function 
at  the  node  i  only  by  considering  the  positive/negative  contributions,  respectively.  We  define 
a  new  quantity  Q  which  estimates  how  far  we  are  from  the  possible  extrema: 


(5.11) 

(5.12) 


To  estimate  the  maximum  and  minimum  values,  we  just  have  to  look  at  the  neighboring  nodes 
because  the  shape  functions  have  compact  support.  For  the  iterative  algorithm,  we  impose 
monotonicity  by  estimating  these  values  using  the  maximum  and  minimum  of  the  previous 
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iterate  and  the  solution  at  step  n.  Then,  in  order  two  prevent  the  formation  of  spurious  under¬ 
shoots  or  overshoots,  the  antidiffusive  flux  fj  must  be  multiplied  by: 


Rf 


min{  1 ,  A"+l  Qf  / Pf }  if  Pf  *  0, 
1  if  Pf  =  0. 


(5.13) 


Finally,  we  must  satisfy  the  balance  of  fluxes  for  global  conservation  (i.e.,  atj  =  a jt). 
This  is  achieved  by: 


_  f  min {Rf ,  Rj }  if  A  f  j  >  0, 
“  1  mini/?-,/?;}  if  Afu  <  0. 


(5.14) 


Taking  the  minimum  value  of  the  pair,  we  take  the  most  conservative  option  to  eliminate  any 
possible  overshoot/undershoot.  More  information  about  Zalesak’s  algorithm  can  be  found  in 
Chapter  2  of  [2]. 

6.  Extension  to  a  system  of  equations.  The  FCT  iterative  algorithm  can  be  extended 
to  a  multivariable  system.  The  remapping  step  is  equivalent  to  the  advection  of  the  conserved 
quantities.  The  most  common  procedure  is  to  remap  p,  pv  and  pE  for  the  Euler  equations, 
where  v  is  the  velocity  field  and  E  the  specific  total  energy  (E  =  e  +  ^ ,  with  e  the  specific 
internal  energy). 

We  can  construct  the  high-order  scheme  the  same  way  we  did  for  the  density.  The  spatial 
discretization  of  the  conserved  variables  is  performed  using  the  same  shape  functions  for 
the  state  vector  and  for  the  convective  fluxes  (generalized  Galerkin).  This  way,  for  the  state 
vector,  we  have: 


p(x)  = 

i 

(6.1) 

pv(x)  = 

i 

(6.2) 

pE(x)  =  Y^jPiEifiix), 

i 

(6.3) 

and  for  the  fluxes: 

p(x)u(x)  =  fjPiUiffx), 

i 

(6.4) 

px{x)  ®  u(x)  =  ^PiVj  ®  u  i<f>t(x), 

i 

(6.5) 

pE(x)a(x)  =  ^piEiUifi(x). 

i 

(6.6) 

The  construction  of  the  volume  and  stiffness  matrices  under  these  assumptions  is  trivial. 
They  have  the  same  structure  as  in  the  case  of  the  density: 

'ym+lyft+l  _  'y«y«  _  j£ft+l/2y«+l/2 

(6.7) 

where  y  is  the  vector  of  nodal  quantities  of  any  or  the  components  of  the  state  vector  and  V 
and  K  are  defined  in  Equations  (2.5)  and  (2.6). 

Therefore,  to  construct  a  low-order  method  from  this  one  we  follow  the  same  procedure 
that  was  used  for  the  density  case,  constructing  a  low-order  stiffness  matrix  L  from  the  high- 
order  counterpart  by  adding  a  symmetric  matrix  D  in  the  way  described  in  Equation  (3.4)  and 
substituting  the  consistent  volume  matrix  by  the  lumped  one. 
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The  system  can  be  solved  independently  for  each  conserved  variable.  Another  possible 
strategy  for  directly  obtaining  the  primitive  variables  is  to  solve  first  for  the  density,  then 
construct  a  new  lumped  mass  matrix  m f  =  v^pi  and  stiffness  matrix  lf{.  -  hjPj,  and  then  solve 
directly  for  the  velocity.  The  same  strategy  applies  to  the  energy. 

The  low-order  scheme  is  chosen  to  preserve  monotonicity  and  LED  constraints  for  the 
conserved  variables.  However,  this  is  not  always  true  for  the  primitive  variables;  for  example, 
in  a  jump,  if  the  density  is  smoothed  more  than  the  linear  momentum,  we  most  likely  obtain 
a  new  maximum/minimum  of  the  velocity.  It  appears  that  there  is  no  way  to  circumvent  this 
issue  and  we  need  to  trust  that  the  low-order  scheme  is  dissipative  enough  to  prevent  this 
potential  problem. 

6.1.  Antidiffusive  correction  for  a  system.  The  construction  of  the  low-order  method 
was  just  an  extension  of  the  algebraic  case.  The  choice  of  a  strategy  to  add  antidiffusive 
fluxes  to  the  low-order  solution  is  not  unique  however  and  still  subject  of  research.  Here,  we 
propose  a  strategy  inspired  by  [3]  . 

The  fluxes  can  be  computed  for  each  one  of  the  conserved  variables  in  the  same  way 
we  did  for  the  density.  Our  interest  is  not  to  prevent  the  conserved  variables  from  going 
beyond  certain  extrema,  but  to  do  this  with  the  primitive  variables  (for  example,  the  density, 
the  internal  energy,  or  the  pressure).  So,  we  must  estimate  a  primitive  variable  antidiffusive 
flux  and  weight  it  to  prevent  overshoots  and  undershoots. 

One  possible  way  of  doing  this  is  supposing  that  the  complete  solution  of  any  variable  is 
just  the  low-order  solution  with  an  addition  of  a  small  flux  such  that  the  product  of  two  fluxes 
conveniently  non-dimensionalized  is  much  less  than  unity.  This  way  yt  =  y*  +  fj.,  where  y 
is  any  variable  and  the  hat  denotes  the  low-order  solution.  For  the  linear  momentum  flux  we 
get: 


pIv,  +  ^v  =  (p!+^.)(y!-  +  /■}), 

and  doing  some  algebra  and  using  that  pfVi  =  'pixf. 


(6.8) 


f.  = 

Jii 


pi 


(6.9) 


The  same  principle  can  be  applied  to  the  total  energy  flux  to  get  the  internal  energy  flux: 


fe.  = 


fpE  _  v  .  fPv  ,  xlL  fP  -  yr.p 
Ja  Jii  ^  2  Jii 


(6.10) 


For  the  iterative  algorithm  described  in  section  5,  the  low-order  solution  is  just  the  previous 
estimate.  Note  that  these  new  fluxes  do  not  have  the  skew- symmetry  property  fy  =  - fjt ; 
however,  Zalesak’s  algorithm  can  be  generalized  to  be  applicable  to  them.  The  first  part  of  the 
algorithm  remains  the  unchanged  (Equations  (5.10)  through  (5.14))  repeating  the  algorithm 
for  each  of  the  primitive  variables. 

Experience  tells  us  that  a  synchronized  algorithm,  which  means  that  the  same  flux  cor¬ 
rection  is  applied  to  all  fluxes,  is  preferable,  because  it  avoids  possible  spurious  oscillations. 
Two  possible  choices  for  picking  the  final  weight  are:  or*/  =  min  ( a?. ,  oA,  orLj  or  using  a  pro¬ 
gressive  algorithm  where  the  weights  are  computed  for  one  of  the  primitive  variables  and 
applied  to  the  conserved  fluxes.  Then  the  flux  for  the  second  primitive  variable  is  computed 
using  the  new  fluxes  and  new  weights  are  obtained,  repeating  this  with  all  the  variables  that 
need  to  be  limited. 
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Note  that  the  computation  of  weights  a  for  each  primitive  variable  is  not  necessary.  We 
can  compute  the  weights  only  for  the  variables  that  we  want  to  preserve  between  bounds 
(e.g.  p  and  e)  and  compute  a  global  weight  using  the  minimum  of  these  two.  This  and  the 
linearization  performed  to  obtain  the  primitive  fluxes  can  give  rise  to  spurious  overshoots  and 
undershoots.  An  easy  way  of  avoiding  this  is  by  the  addition  of  a  failsafe  algorithm  [3].  The 
failsafe  algorithm  detects  when  an  overshoot/undershoot  occurs  and  decreases  the  amount  of 
antidiffusive  flux  added  to  the  particular  node  where  this  happens,  and  their  neighbors,  in 
order  to  preserve  conservation.  This  can  be  done  by  subtracting  all  the  antidiffusive  contribu¬ 
tions  at  once  to  obtain  the  low-order  solution  or  doing  this  in  various  steps.  At  each  step,  we 
subtract  a  proportional  part  of  the  flux  until  we  cancel  the  spurious  oscillations. 

7.  Test  Results.  In  this  section,  we  show  some  typical  test  cases  that  are  used  to  measure 
the  accuracy  of  the  method.  For  simplicity,  we  use  a  one-dimensional  version  of  the  code. 
First,  results  for  one  variable  are  shown  and  in  the  second  part  of  this  section  we  analyze  the 
multivariable  algorithm.  The  mesh  motion  is  based  on  the  one-dimensional  remesh  strategy 
described  in  [5]: 


—  Xmin  \Xmax  Xmin)£\f->  0? 
e(e,  t)  =  (1  -  a(t))e  +  a(t)e 3, 
sin  Ant 

a(t)  =  -3-, 

for  0  <e  <  1  and  0  <  t  <  1.  The  node  positions  are  xk  =  x(6i,tk )  =  p-),  where 

N  is  the  number  of  cells  and  kmax  the  number  of  pseudo-time  steps.  Typical  tests  are  run  for 
the  pairs  (A,  kmax)  =  (64, 320),  (128, 640),  (256, 1280).  Note  that  the  ratio  between  the  two  is 
constant,  leaving  the  CFL  number  unchanged.  In  Figure  7.1  the  mesh  motion  is  shown  for 
N  =  24  and  kmax  =  16. 


(7.1) 

(7.2) 

(7.3) 


Fig.  7.1.  Mesh  motion  for  N  =  24  and  kmax  —  16.  Under  this  motion,  each  cell  completes  two  sinusoidal  cycles 
of  expansion  and  contraction 


7.1.  Results  for  one  variable.  The  typical  tests  that  are  run  for  testing  an  advection  code 
are  the  square,  triangle  and  bump  shapes.  These  tests  are  applied  to  the  remesh  in  Figure  7.2. 
As  described  at  the  end  of  section  5,  the  best  results  for  the  remapping  are  obtained  adding 
antidiffusive  fluxes  either  one  at  half- step  and  one  at  final  step  (called  FCT2  in  following 
figures)  or  just  adding  the  antidiffusive  flux  at  the  end  (called  FCT1).  To  show  this,  the 
following  figures  compare  the  exact  solution,  the  low-order  solution  and  the  solution  obtained 
with  the  FCT  algorithm  using  the  two  different  antidiffusion  strategies  for  N  =  128  and 
kmax  =  640.  Table  7.1,  along  with  the  figures,  shows  that  the  FCT2  algorithm  is  better  for 
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the  square  shape.  However,  the  error  of  FCT2  and  FCT1  is  comparable  for  the  triangle  and 
the  bump  (FCT1  even  performs  better  for  the  bump).  While  the  more  dissipative  algorithm 
adapts  better  to  the  shape  of  the  curve,  the  less  dissipative  gives  a  better  peak  value. 


(a)  Square  shape 


(b)  Triangular  shape 


Fig.  7.2.  Remeshed  solution  for  square,  triangular  and  bump  shapes.  The  method  FCT2  produces  a  better 
result  for  the  square  than  FCT1,  which  is  more  dissipative  since  only  one  antidiffusion  addition  is  performed.  Zale- 
sak’s  algorithm  attempts  to  transform  the  triangular  and  bump  shapes  into  a  square  shock  or  a  succession  of  them 
( terracing ),  especially  when  applied  multiple  times  or  the  resolution  increases.  This  way,  FCT1  is  normally  more 
robust  although  FCT2  offers  better  peak  values. 


Table  7.1 

L\  error  for  N  -  128,  kmax  =  640 


Shape 

Low  Order 

FCT1 

FCT2 

Square 

0.6434 

0.1292 

0.0741 

Triangle 

0.3968 

0.0368 

0.0306 

Bump 

0.4894 

0.0247 

0.0249 

The  most  demanding  test  found  in  the  literature  is  the  exponential  shock  ([5],  [9], [11]). 
The  density  profile  for  this  test  is  composed  of  an  exponential  increase  followed  by  a  shock 
and  represents  an  out-going  explosion.  This  test  is  run  using  FCT1  and  FCT2  in  Figure  7.3 
for  three  pairs  of  (N,  kmax)  in  order  to  analyze  convergence  of  the  method.  FCT2  gives  a 
higher  density  peak  but  shows  terracing.  FCT1  is  more  diffusive  but  behaves  better  for  the 
exponential  part  of  the  test.  Table  7.2  shows  that  indeed  the  errors  in  the  final  result  are  paired 
for  both  methods. 
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(a)  FCT1 


(b)  FCT2 


Fig.  7.3.  Remeshed  solution  for  the  exponential  jump.  As  predicted,  FCT2  offers  a  best  peak  value  but  terracing 
is  appreciated  before  the  peak  at  the  exponential  part  of  the  profile.  FCT1  does  not  produce  terracing. 

Table  7.2 

L\  error  and  convergence  rate  ( C.R.)  for  the  exponential  shock  problem  with  FCT1  and  FCT2 


Mesh 

FCT1 

FCT2 

Error 

C.R. 

Error 

C.R. 

kmax  ~ 

=  320,  N  =  64 

4.6840 

4.6624 

kmax  — 

640,  N  =  128 

3.1614 

0.5672 

2.8510 

0.7096 

kmax  = 

1280,  N  =  256 

1.7572 

0.8473 

1.3873 

1.0392 

7.2.  Results  for  a  multivariable  system.  Two  one-dimensional  multivariable  tests  can 
be  found  in  [5]  and  [11].  The  first  one  is  a  simple  jump  for  density,  velocity  and  internal 
energy,  while  the  second  one  is  an  extension  of  the  exponential  jump  described  for  one  vari¬ 
able.  The  density  jump  is  complemented  by  a  linear  function  in  velocity  and  internal  energy 
followed  by  a  shock. 

The  results  shown  are  obtained  by  limiting  the  density  and  internal  energy.  Velocity  can 
be  limited  too,  but  in  multidimensions  that  would  involve  the  computation  of  a  limit  for  each 
of  the  three  components  of  the  vector.  A  limiting  for  the  kinetic  energy  is  a  possible  substitute 
for  this.  The  failsafe  algorithm  is  used  to  avoid  possible  overshoots  and  undershoots. 

For  the  simple  jump,  the  limiting  of  density  and  internal  energy  is  enough  to  preserve  the 
monotonicity  of  all  the  variables  of  the  system.  Figure  7.4  shows  results  for  FCT2.  Table  7.3 
shows  how  the  FCT2  behaves  better  than  the  FCT1  for  this  type  of  problem,  since  straight 
jumps  do  not  normally  suffer  from  terracing. 


Table  7.3 

L\  error  and  Convergence  Rate  (C.R.)for  the  simple  jump  problem  with  FCT1  and  FCT2 


FCT1 

FCT2 

Mesh 

P 

V 

e 

P 

V 

e 

kmax  ~ 

=  320,  N  =  64 

0.0816 

0.0356 

0.0232 

0.0483 

0.0222 

0.0149 

C.R. 

0.7038 

0.7075 

0.6981 

0.7763 

0.7832 

0.7762 

kmax  — 

640,  N=  128 

0.0501 

0.0218 

0.0143 

0.0282 

0.0129 

0.0087 

C.R. 

0.7019 

0.7021 

0.7004 

0.7558 

0.7444 

0.7705 

k>nax  ~ 

1280,  N  =  256 

0.0308 

0.0134 

0.0088 

0.0167 

0.077 

0.0051 

Figures  7.5  and  7.6  show  results  for  the  exponential  shock  system.  FCT1  gives  smooth 
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Density 


Velocity 


Internal  Energy  Linear  Momentum 


Fig.  7.4.  Remesh  of  a  simple  jump.  Density,  linear  momentum  and  total  energy  are  conserved.  Internal  energy 
and  density  are  limited  to  remain  monotone.  In  this  case,  the  methods  leads  to  monotonicity  of  all  variables. 


results  for  the  exponential  part  of  the  curve,  while  FCT2  provides  a  better  peak  result  but 
some  terracing.  Table  7.4  reveals  that  FCT1  is  close  to  the  accuracy  of  FCT2  due  to  the 
absence  of  terracing,  while  for  the  simple  shock  the  error  obtained  with  the  first  was  almost 
twice  the  error  of  the  second  in  the  L\  norm. 

8.  Conclusions.  We  developed  a  Flux-Corrected  Transport  algorithm  for  the  remeshing 
step  of  a  Finite  Element  ALE  method  using  an  multi-step  algorithm  for  the  antidiffusive  flux 
addition  and  solving  the  linear  system  with  the  aid  of  a  predictor-multi-corrector  algorithm. 
This  way,  the  algorithm  is  fast  and  typically  only  requires  one  or  two  steps  of  antidiffusion 
addition  and  two  or  three  steps  of  the  multi-corrector  to  converge. 

The  results  obtained  are  comparable  with  other  results  from  the  same  field  (i.e.,  [5] 
and  [11],  where  finite-differences  were  used).  Our  tests  reveal  that  the  ideal  number  of 
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Density 


(a)  Density 

Internal  Energy 


(c)  Internal  Energy 


(d)  Linear  Momentum 


Total  Energy 


(e)  Total  Energy 


Fig.  7.5.  Remesh  of  the  exponential  shock  extended  to  velocity  and  internal  energy  with  FCT1.  Results  do  not 
show  terracing  and  extra-diffusion.  Constraining  of  multiple  variables  does  not  seem  to  influence  the  peak  value  for 
the  density  when  compared  to  the  previous  one-variable  results 


antidiffusion-addition  steps  is  between  one  and  two.  For  straight  jumps,  having  two  or  more 
steps  proves  to  be  better,  because  Zalesak’s  algorithm  tends  to  transform  any  possible  shape 
into  straight  shocks  or  a  succession  of  them.  For  general  problems  with  continuous  functions, 
the  results  using  one  and  two  steps  are  comparable,  the  first  one  being  more  capable  of  cor¬ 
rectly  remeshing  the  original  shape  of  the  curve  and  the  second  giving  more  accurate  values 
close  to  the  possible  peaks. 
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Density 


(a)  Density 


Internal  Energy 


Velocity 


(c)  Internal  Energy 


(d)  Linear  Momentum 


Total  Energy 


(e)  Total  Energy 


Fig.  7.6.  Remesh  of  the  exponential  shock  extended  to  velocity  and  internal  energy  with  FCT2.  Some  terracing 
is  appreciated  but  the  peak  values  are  closer  to  the  original  value  than  with  FCT1. 
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MOLECULAR  DYNAMICS  SIMULATIONS  OF  SINGLE  CONJUGATED 
POLYMER  NANOPARTICLE 

SABINA  MASKEY*,  FLINT  PIERCE^  DVORA  PERAHIA* *,  STEVEN  J.  PLIMPTON§,  AND  GARY  S. 

GREST^ 

Abstract.  Molecular  dynamics  simulations  have  been  utilized  to  study  the  factors  that  hold  nanoparticles  formed 
by  conjugated  polymers  in  their  collapsed  conformations.  Here  we  present  simulations  in  which  dialkyl  poly  para 
phenyleneethynylene  (PPE),  an  optically  active  polymer,  is  used  to  make  a  single  conjugated  polymer  nanoparticle 
(CPN).  CPNs  have  potential  applications  in  fluorescence  imaging,  bio-sensors  and  optoelectronic  devices.  The  use 
of  these  nanoparticles  depends  on  their  stability  in  different  solvents.  We  found  that  these  polymer  nanoparticles  are 
stable  and  remain  collapsed  in  a  poor  solvent  but  unravel  in  a  good  solvent. 

1.  Introduction.  Conjugated  polymers  are  organic  semi-conducting  materials  which 
are  electro-optically  active  having  applications  from  sensing  to  organic  electronics  [1,  3,  11, 
13].  The  packing  of  these  highly  optically  active  polymers  into  nano  dimensions  enhance 
their  potential  for  sensing  and  fluorescent  probes.  These  polymers,  whose  natural  conforma¬ 
tion  is  stretched,  can  be  forced  experimentally  [12,  16]  into  collapsed  spherical  conjugated 
polymer  nanoparticles  (CPNs).  This  raises  questions  regarding  the  factors  that  control  their 
stability  since  this  collapsed  conformation  is  far  from  equilibrium.  Nanoparticles  have  differ¬ 
ent  properties  than  those  of  bulk  materials  and  molecules.  Much  of  the  interest  in  nanopar¬ 
ticles  arises  from  their  large  surface  area  and  their  small  size  which  is  comparable  to  that 
of  many  molecules  together  with  their  unique  electro-optical  and  magnetic  properties.  Their 
electro-optical  characteristics  differ  from  that  of  bulk  due  to  quantum  confinement  on  the 
nanoscale. 

Because  of  the  immense  technological  potential  of  polymer  dots-  or  CPNs  and  their  far- 
from-equilibrium  conformation  within  these  nanoparticles,  we  probe  the  factors  that  retain  the 
PPE  molecules  within  their  collapsed  configuration.  Molecular  dynamics  (MD)  simulation 
is  used  to  investigate  the  molecular  conformation  of  confined  PPEs  into  nanoparticles  and 
elucidate  the  interactions  that  result  in  the  formation  of  stable  nanoparticles.  The  current 
study  focuses  on  diethylhexyl  para  phenyleneethynylene  (PPE)  and  compares  the  results  to 
the  same  chain  without  side  chains.  The  repeat  unit  of  the  dialkyl  PPE  is  shown  in  Figure  1.1. 

The  backbone  of  PPE  consists  of  alternate  single  and  triple  bonds  in  conjunction  with 
aromatic  rings  as  shown  in  Figure  1.1.  The  single  bonds  along  the  backbone  allow  the  aro¬ 
matic  ring  to  freely  rotate  along  the  long  axis  of  the  molecule.  When  the  aromatic  rings  are 
confined  in  a  single  plane,  the  overlap  of  n  orbitals  result  in  delocalization  of  electrons  along 
extended  segments  of  the  backbone  and  the  backbone  of  the  PPE  becomes  fully  conjugated. 
As  a  result,  PPE  are  intrinsic  semi-conductor  and  often  emit  light  on  excitation.  Small  angle 
neutron  scattering  (SANS)  have  shown  that  dialkyl  PPE  in  toluene,  which  is  a  good  solvent  for 
the  backbone,  forms  a  complex  phase  diagram  depending  on  the  concentration  and  tempera¬ 
ture  including  molecular  solutions,  micellar  structures  and  fragile  gels  [7].  The  conformation 
of  the  PPE  molecules  in  these  three  phases  depends  on  their  degree  of  constraint. 

Conformational  studies  of  single  chains  of  various  di- substituted  alkyl  PPEs  have  been 
carried  out  by  using  MD  simulations  in  explicit  and  implicit  solvents.  These  studies  have 
shown  that  these  PPE  molecules  form  rigid  rod-like  structures  and  do  not  form  random  coils 
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Fig.  1.1.  The  chemical  structure  of  poly  para  phenyleneethynylene(PPE).  Here  R  represents  ethylhexyl  and  H. 


or  collapsed  structures  for  degrees  of  polymerization  up  to  n=2000  for  a  number  of  solvents 

[6]. 


Recently  experimental  studies  have  been  performed  on  single  conjugated  polymer  nanopar¬ 
ticles  (CPNs)  to  study  their  properties  and  stability.  Their  optical  properties  differ  signif¬ 
icantly  from  those  of  free  molecules  and  in  addition  are  highly  fluorescent  also.  McNeill 
and  co-workers  have  demonstrated  that  several  hydrophobic  conjugated  polymers,  when  dis¬ 
solved  in  a  good  solvent  for  overall  polymer  and  poured  into  a  poor  solvent  such  as  water, 
which  is  miscible  with  organic  solvent  and  sonicated,  assist  in  the  formation  stable  nanopar¬ 
ticles  [12,  16].  After  that  organic  solvent  is  removed,  stable  CPNs  are  dispersed  in  a  aqueous 
medium.  Depending  on  the  concentration  and  conformation  of  the  polymers,  the  size  and 
properties  of  these  CPNs  can  be  tuned.  The  confinement  of  these  unimolecular  nanoparti¬ 
cles  resulted  in  a  clear  change  in  their  photophysics.  Nanoparticles  consisting  of  conjugated 
polymers  have  emerged  as  bright  fluorescent  markers.  McNeill  and  coworkers  investigated 
optical  properties  and  energy  transfer  phenomena  of  these  CNPs  [14].  They  have  shown  that 
these  CPNs  have  potential  applications  in  fluorescence  imaging  [17].  Other  studies  on  CPNs 
formed  by  PPE  containing  amines  have  shown  their  use  for  imaging  of  cells  in  a  tissue  [9]. 
These  confined  PPE  CNPs  remain  in  suspension  and  are  optically  active  for  months  after  their 
preparations.  The  main  reason  that  these  CPNs  are  gaining  interest  is  due  to  the  fact  they  are 
easy  to  synthesize  as  well  as  can  be  easily  tuned  according  to  desired  application  by  choice 
of  conjugated  polymers  and  above  all  are  biocompatible. 

Using  MD  simulations,  the  current  study  is  aimed  to  determine  conformation  and  sta¬ 
bility  of  the  conjugated  PPE  polymer  nanoparticles  formed  in  both  good  and  poor  solvents. 
Computationally,  it  is  challenging  to  imitate  the  exact  experimental  procedures  used  to  make 
these  CNPs.  Mimicking  the  confinement  of  these  CNPs,  we  have  enclosed  the  PPE  chain  in 
a  large  sphere.  We  then  slowly  decreased  the  size  of  the  sphere  collapsing  the  polymer  chain 
as  the  function  of  time  until  the  sphere  is  compressed  to  the  maximum  extent  possible  after 
which  the  collapsed  polymer  nanoparticle  was  allowed  to  relax  in  either  a  good  or  a  poor 
solvent. 


2.  Simulation  Methodology  and  Results.  The  PPEs  used  in  this  work  have  been  mod¬ 
eled  using  the  fully  atomistic  Optimized  Parameter  for  Liquid  Simulator- All  Atoms  (OPLS- 
AA)  potential  of  Jorgensen  et  al.  [4,  5].  The  OPLS-AA  is  comprised  of  several  potential 
terms, 


UoPLS  -AA  C nb  C bond  C ang  C tor  C \m p 


(2.1) 
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Nonbonded  interactions  Unb  are  a  sum  of  standard  12-6  Lennard-Jones  (LJ)  and  electrostatic 
potentials  [4,  5], 


Unb(?ij)  ~  U LJ  +  Ucoul  ~  46 L 


r  /  \  12 

/  \  61 

M  - 

.1  rij  1 

\  ru  1 

kcoul 


Mj 


(2.2) 


where  6,-y  is  the  LJ  energy  and  <x,y  is  the  LJ  diameter  for  atoms  i  and  /,  qt  and  qj  are  their 
partial  charges.  For  atoms  of  different  species,  geometric  mixing  rules  are  used,  etj  =  (e^-)5 
and  o~ij  =  Nonbonded  interactions  are  calculated  between  all  atomic  pairs  on  dif¬ 

ferent  molecules  in  addition  to  all  pairs  on  the  same  molecule  separated  by  three  or  more 
bonds.  The  interaction  is  reduced  by  a  factor  of  1/2  for  atoms  separated  by  three  bonds.  All 
LJ  interactions  were  cut  off  at  12A.  All  electrostatic  interactions  for  atom  pairs  closer  than 
12A  are  calculated  in  real  space,  while  those  outside  this  range  are  calculated  in  reciprocal 
(Fourier)  space  by  the  use  of  a  standard  particle-particle  particle-mesh  (PPPM)  algorithm  [2] 
with  a  precision  of  10~4.  Direct  covalent  bonds  between  atoms  are  modeled  in  the  OPLS  as 
harmonic  potentials  of  the  form 


Ubondirij)  =  kr(ru  -  r0)2  (2.3) 

where  rtj  is  the  distance  between  atoms  i  and  j  and  ro  is  their  equilibrium  separation.  Har¬ 
monic  forms  are  also  used  to  describe  the  angle  bending  potential  of  directly  bonding  chains 
of  three  atoms  (i,j,k) 


Uang(0ijk)  =  koiQijk  ~  e,f  (2.4) 

Oijk  is  the  angle  between  the  vectors  rji  and  rjk,  and  is  the  equilibrium  value.  The  torsional 
(dihedral)  component  of  the  OPLS  potential  is  given  by 

4  k 

Utor(<p)  =  Vy[l-(-l)"c^(#)]  (2.5) 

n=  1  Z 

where  0  is  the  dihedral  angle  and  kn  is  energy  [15].  In  the  OPLS-AA  framework,  the  out  of 
plane  (improper)  potential  has  the  same  form  as  that  of  the  torsional  potential. 

The  PPE  chains  were  built  using  a  Polymer  Builder  of  Material  Studio  from  Accelrys 
Inc®.  The  samples  were  originally  built  using  the  polymer  consistent  force  field  (pcff)  as 
OPLS-AA  potential  is  not  implemented  in  Material  Studio.  An  in-house  conversion  utility 
was  used  to  convert  the  Material  Studio  data  files  into  LAMMPS  data  files  and  the  force  field 
as  changed  from  pcff  to  OPLS-AA.  All  the  simulations  were  performed  using  the  LAMMPS 
classical  MD  code  [8].  The  equation  of  motion  has  been  integrated  using  velocity-Verlet 
algorithm  with  a  time  step  of  A^  =  1  fs.  PPE  nanoparticles  are  formed  by  compressing 
an  isolated  PPE  chains  into  a  final  radii  of  1  and  2  nm  by  enclosing  them  within  a  large 
spherical  cavity  whose  radius  is  slowly  reduced  over  1  ns  in  a  poor  solvent.  The  cavity  wall 
interacts  with  the  PPE  chains  via  a  harmonic  potential  using  the  indent  fix  in  the  LAMMPS 
MD  code.  The  PPE  chains  were  collapsed  to  form  a  nanoparticle  of  a  desired  radius  using 
NVE  ensemble  with  a  Langevin  thermostat  [10]  with  100  fs  damping  time  to  regulate  the 
temperature  at  300  K.  After  compressing  the  PPE  molecule  to  its  final  size,  the  enclosed 
cavity  is  removed  and  the  compressed  PPE  nanoparticle  is  allowed  to  relax.  Then  simulations 
were  run  in  an  implicit  good  and  an  implicit  poor  solvents  due  to  a  computational  limitation. 
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(a) 


Fig.  2.1.  Snapshots  of  diethylhexyl  PPEfor  (a)  t  =  0  (b)  t  =  0.5  and  (c)  t  =  1  ns.  The  final  radius  of  the  CPN  is 
2  nm.  Here  carbon  is  cyan  and  hydrogen  is  white. 


(a) 


(c) 


Fig.  2.2.  Snapshots  ofPPE  without  side  chains  for  (a)  t  =  0  (b)  t  =  0.5  and  (c)  t  =  1  ns.  The  final  radius  of  the 
CPN  is  1  nm. 


To  model  a  solvent  which  is  poor  for  both  the  backbone  and  side  chains  such  as  water,  all 
the  interactions  were  truncated  at  12  A.  To  model  a  good  solvent  for  both  backbone  and 
side  chains,  all  the  Lennard-Jones  interactions  in  Equation  2  were  truncated  at  the  potential 
minimum  (rc  =  21/6<x;7),  producing  a  purely  repulsive  nonbonded  forces. 

Snapshots  of  nanoparticles  as  they  are  formed  are  shown  in  Figure  2.1  for  diethylhexyl 
PPE  and  in  Figure  2.2  for  the  PPE  molecule  without  side  chains,  R=H.  For  comparison, 
Figures  2.1a  and  2.2a,  show  the  equilibrated  PPE  chain  of  length  n  =  240  repeat  unit  before 
enclosing  it  in  the  spherical  cavity.  As  the  radius  of  the  cavity  decreases  the  ends  of  the 
chains  start  to  coil  up  gradually  with  time  as  shown  in  Figures  2.1b  and  2.2b,  finally  forming 
a  collapsed  PPE  nanoparticles.  The  final  size  of  the  CPN  depends  on  the  side  chains.  The 
maximum  amount  that  CPNs  could  be  compressed  was  radii  of  1  nm  for  PPE  without  side 
chains  and  2  nm  for  diethy hexyl  PPE. 

Once  the  PPE  polymers  were  compressed,  the  indenter  was  removed,  and  the  CPNs  were 
allowed  to  relax  in  a  solvent.  Figures  2.3  and  2.4  show  our  results  for  PPE  CPNs  with  and 
without  side  chains  in  a  poor  and  a  good  solvent.  Visually,  we  can  see  strong  differences.  In 
the  poor  solvent,  the  CPNs  initially  expand  somewhat  but  remain  compact  over  the  course 
of  the  simulation.  However,  in  the  good  solvent,  these  CPNs  rapidly  uncoil  and  continue  to 
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unravel  on  the  course  of  the  run. 


Fig.  2.3.  Snapshots  of  die  thy  hexyl  PPE  CNPs  in  (a)  poor  solvent  at  50  ns  and  (b)  good  solvent  at  100  ns 


Fig.  2.4.  Snapshots  of  PPE  CNPs  without  side  chains  in  (a)  poor  solvent  at  50  ns  and  (b)  good  solvent  at  100  ns 


To  quantify  our  results,  we  measured  radius  of  gyrations,  Rg  of  NPs  as  a  function  of  time 
in  different  solvents.  Figure  2.5  shows  the  radius  of  gyration  Rg  as  a  function  of  time  for  PPE 
CNPs  with  and  without  side  chains  in  poor  and  good  solvents.  The  radius  of  gyration  Rg  for 
the  two  CPNs  in  the  poor  solvent  increase  slightly  at  first  and  then  remain  stable  for  the  period 
of  the  run.  Hence,  these  CNPs  are  stable  in  poor  solvents  in  agreement  with  the  experiment. 
In  the  good  solvent,  however,  Rg  continues  to  increase  over  the  course  of  the  simulation  as 
the  chain  unravels.  This  uncoiling  takes  place  faster  in  case  of  CNP  without  side  chains  than 
with  side  chains  which  is  supported  by  an  increase  in  Rg  values  as  in  Figure  2.6.  Rg  rapidly 
increases  in  CNP  without  side  chains  than  with  side  chain  CNP. 

3.  Conclusions.  Our  molecular  dynamics  (MD)  simulations  have  shown  that  collapsed 
PPE  nanoparticles  depending  on  solvents  and  side  chains  show  different  stability  patterns.  In 
a  poor  solvent,  these  polymers  retain  their  collapsed  conformations  which  is  shown  by  the 
calculation  of  Rg.  This  stability  pattern  is  consistent  with  the  experimental  measurements  of 
Rg  by  SANS.  However,  in  the  good  solvent  the  CPNs  unravel  rapidly  and  continue  to  uncoil¬ 
ing  during  the  course  of  simulation.  The  size  of  the  collapsed  nanoparticles  depends  on  the 
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Fig.  2.5.  Radius  of  gyration  of  PPE  CNPs  as  a  function  of  time  for  (a)  diethylhexyl  PPE  and  (b)  for  PPE 
without  side  chain  in  poor  (circles)  and  good  (squares)  solvents. 


type  of  side  chains.  For  the  first  time,  this  study  has  provided  an  insight  into  the  factors  that 
control  the  stability  of  PPE  polymers  as  nanoparticle.  It  seems  that  van  der  Waals  interactions 
are  sufficient  to  hold  these  CPNs  together.  To  further  our  work,  the  behavior  of  these  CPNs 
in  explicit  solvent  models  such  as  water  and  toluene  will  be  studied  to  tune  the  interaction 
between  the  solvent  and  the  polymers.  Continuing  with  the  study,  we  will  investigate  the 
conformations  and  stability  of  conjugated  polymer  nanoparticles  using  different  lengths  and 
types  of  side  chains,  sizes,  and  molecular  weights  in  both  implicit  and  explicit  solvents  as 
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Fig.  2.6.  Radius  of  gyration  of  P PE  CNPs  with  ( triangle )  and  without  ( diamond )  side  chain  as  a  function  of 
time  in  good  solvent. 


well  as  function  of  temperature. 
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CHARGE  TRAPS  AND  LOCAL  ATOMIC  RELAXATIONS  IN  AMORPHOUS 

SILICON  DIOXIDE 

NATHAN  L.  ANDERSON*,  PETER  A.  SCHULTZ’,  AND  ALEJANDRO  STRACHAN* 

Abstract.  We  apply  density  functional  techniques  that  properly  account  for  the  electrostatics  in  the  presence 
of  periodic  boundary  conditions  to  report  the  defect  levels  of  various  charge  transitions  for  a  number  of  different 
defects  in  amorphous  silicon  dioxide.  In  addition,  for  each  transition  level,  we  provide  the  physical  explanation  of 
the  atomic  bonding  rearrangements  which  occur  upon  electron  or  hole  capture.  We  find  that  paramagnetic  E'y  and 
Ep  defects  exist  in  the  neutral  charge  state  and  are  capable  of  trapping  both  electrons  and  holes.  Statistical  support 
for  the  oxygen  vacancy  originated  dimerized  model  of  the  positively  charged  E'§  defect  is  demonstrated.  As  the 
result  of  hole  trapping  by  an  undercoordinated  silicon  we  show  a  significant  number  of  stable  positively  charged 
overcoordinated  oxygen  defects  result  and  are  likely  a  prevalent  source  of  trapped  charge.  Finally,  evidence  for  the 
existence  of  an  ill-established  hole  trap  associated  with  an  overcoordinated  silicon  floating  bond  defect  is  presented. 


1.  Introduction.  Point  defects  in  amorphous  silicon  dioxide  (a-SiC^)  are  known  to  play 
a  principal  role  in  trapping  charges.  The  wide  range  of  technological  applications  utilizing 
<2-SiC>2  results  in  these  trapped  charges  being  responsible  for  the  failure  of  an  assortment  of 
systems  from  electronic  devices  to  communication  technologies.  Most  prominently  observed 
are  threshold  voltage  shifts  that  lead  to  the  inability  to  perform  the  switching  operations  of 
metal-oxide- semiconductor  transistors  and  attenuation  in  optical  fibers  resulting  in  distance 
limitations  on  signal  propagation.  Understanding  the  physical  nature  of  these  charge  traps  at 
an  atomic  scale  can  provide  insight  toward  charge  trap  specific  processing  techniques  targeted 
at  annealing  the  corresponding  defects. 

In  addition,  predicting  defect  levels  through  atomistic  scale  modeling  techniques  yields 
parameters  that  can  be  used  in  larger  scale  device  level  models,  improving  the  accuracy  of 
predicting  long-term  device  performance.  This  multiscale  approach  can  be  extended  to  other 
material  systems  eventually  culminating  in  the  ability  to  predict  new  materials  and  better 
performing  devices.  Here,  we  use  quantum  mechanical  methods  to  predict  the  energy  levels 
and  provide  a  physical  explanation  for  defect  states  that  lie  within  the  a-Si02  band  gap. 

2.  Method.  A  total  of  27  stoichiometric  and  42  oxygen  deficient  (single  oxygen  atom 
removed)  <2-Si02  structures  containing  respectively  192  and  191  atoms  have  been  obtained 
from  a  previous  study  [3].  Each  one  of  these  structures  contained  a  singly  isolated  defect  or 
defect  pair  described  in  the  proceeding  section.  Density  functional  theory  (DFT)  [6]  within 
the  generalized  gradient  approximation  (GGA)  [12]  is  performed  on  each  structure  for  charge 
states  spanning  (2+)  with  the  code  SeqQuest  [13].  Each  charge  state  is  computed  via  the  local 
moment  counter  charge  (LMCC)  method  [14]  which  eliminates  the  divergence  created  by  the 
Coulomb  potential  when  subject  to  periodic  boundary  conditions  (PBCs)[15].  An  electron 
reservoir  (chemical  potential)  is  set  by  putting  the  charged  cell  in  contact  with  the  neutral 
cell  for  each  defect.  The  charge  is  located  at  each  defect  site  to  investigate  the  differences  in 
the  resulting  electron  densities  after  solving  the  Kohn-Sham  (KS)  self-consistent  equations 
[9],  resulting  in  a  total  of  675  DFT  calculations.  In  addition  the  polarization  energy  which 
is  added  to  a  region  outside  of  the  supercell  from  each  charge  is  accounted  for  through  a 
model  proposed  by  lost  [7]  and  refined  by  Schultz  [16].  Polarization  energies  are  calculated 
in  both  the  static  and  infinite  frequency  dielectric  constant  limits.  This  method  results  in  the 
total  energies  and  relaxed  geometries  of  five  different  charge  states  at  each  defect  site,  leading 
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to  the  abilty  to  calculate  charge  transition  levels  and  correlate  each  level  to  a  unique  atomic 
relaxation. 

3.  Neutral  Defects  in  <2-Si(>2.  Here  we  review  the  structural  properties  of  the  neutral  a- 
SiC>2  point  defects  that  are  subject  to  this  investigation.  Defects  in  amorphous  materials  must 
be  defined  with  reference  to  a  defect  free  system.  Defect  free  a- SiC>2  is  defined  as  a  contin¬ 
uous  random  network  of  SiC>4  tetrahedra  connected  via  doubly  coordinated  bridging  oxygen 
atoms.  Hence,  point  defects  in  a-SiC>2  are  prominently  associated  with  a  coordination  devia¬ 
tion  from  the  ideal  four-fold  Si  and  two-fold  O.  The  defects  we  are  studied  here  involve  under¬ 
coordinated  silicon  (Ill-Si),  overcoordinated  silicon  (V-Si),  overcoordinated  oxygen  (III-O), 
and  two-fold  coordinated  silicon  (11-Si). 

The  neutral  oxygen  vacancy  (NOV)  is  a  defect  that  can  basically  be  considered  a  Si-Si 
wrong  bond  within  the  amorphous  network 

=Si  -  Si=  (3.1) 

where  the  “=”  denotes  three  network  Si-0  bonds.  The  dissociated  three-coordinated  silicon 
(D-III-Si)  is  a  defect  pair  consisting  of  two  isolated  paramagnetic  Ill-Si  atoms 

=Si*  +  =Si*  (3.2) 

with  the  representing  an  unpaired  electron,  or  a  Si  dangling  bond.  Ill-O/III-Si  is  a  defect 
pair  containing  an  isolated  positively  charged  diamagnetic  III-O  and  an  isolated  negatively 
charged  diamagnetic  III- Si 


+=0  +  =Sir  (3.3) 

the  denotes  paired  electrons  and  the  positive  charge  corresponding  to  a  hole  delocalized 
over  three  S-0  bonds  made  explicit  in  the  notation.  Observed  in  systems  with  perfect  sto¬ 
ichiometry,  III-Si/V-Si  is  a  defect  pair  that  includes  a  positively  charged  diamagnetic  Ill-Si 
and  a  negatively  charged  V-Si 


=Si+  +  =Si=-  (3.4) 

with  “=”  adopted  for  two  Si-0  bonds.  Again  seen  in  stoichiometric  systems  is  a  pair  includ¬ 
ing  an  isolated  positive  diamagnetic  III-O  and  an  isolated  negative  V-Si 

+=0  +  =Si="  (3.5) 

given  the  nomenclature  III-0/V-Si  and  again  the  location  of  positive  charge  distinctly  incor¬ 
porated  in  the  notation.  The  final  defect  investigated  here  is  a  isolated  II- Si 


=Si*  (3.6) 

where  the  on  each  side  of  the  Si  denotes  unpaired  electrons  in  seperate  sp3  orbitals. 

4.  Charge  Transition  Levels.  Self-consistently  solving  the  KS  equations  results  in  the 
KS  eigenvalues.  Hence,  in  systems  containing  a  forbidden  electronic  occupation  region,  DFT 
does  not  calculate  the  fundamental  band  gap  but  rather  calculates  a  KS  gap  which  is  typi¬ 
cally  an  underestimation  of  what  is  observed  experimentally.  This  is  the  so  called  “band  gap 
problem”  associated  with  DFT.  Schultz  [16]  has  demonstrated,  however,  that  by  calculating 
the  energy  levels  for  a  variety  of  defects  in  multiple  charge  states  one  can  obtain  a  charge 
transition  spectrum  that  spans  the  experimental  band  gap.  The  charge  transitions  are  simply 
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a-SiOa  Charge  Transition  Levels  Ci!J2/l<3l-atons> 


Fig.  4.1.  Defect  levels  of  charge  transitions  in  a-SiO 2. 


ionization  potentials  and  electron  affinities,  the  difference  between  the  polarization  corrected 
total  energies  of  the  two  charge  states.  Each  transition  level  is  calculated  for  every  defect  site 
and  the  resulting  spectrum  in  the  static  dielectric  limit  is  shown  in  Fig.  4.1.  Although  this 
does  not  span  the  experimental  a-SiC>2  band  gap  of  ~9  eV,  it  does  provide  a  rich  amount  of 
qualitative  information  about  the  physical  nature  of  each  defect  level. 

To  determine  the  energy  level  of  a  defect  within  the  silica  band  gap  we  must  first  define 
the  valence  band  (VB)  and  conduction  band  (CB)  edges.  Our  approach  to  this  definition 
involves  a  combination  of  the  KS  eigenvalue  occupations  and  local  geometric  relaxations. 

For  every  defect  site  the  KS  eigenvalues  are  plotted  for  each  charge  state  as  a  function 
of  electron  occupation  near  the  KS  gap,  we  will  refer  to  this  as  a  KS- spectrum.  To  facilitate 
comparative  analysis  each  charge  state  KS-spectrum  is  aligned  at  the  oxygen  2s  level  of  the 
neutral  calculation,  which  lies  deep  in  the  valence  band  preventing  any  edge  distortions  from 
biasing  the  alignment.  One  example  of  a  KS-spectrum  for  each  defect  type  is  shown  in  Fig. 
4.2(a)  and  all  135  KS-spectra  used  in  our  analysis  are  available  in  the  Supplementary  Material 
[2]. 

The  local  geometric  structure  of  each  defect  site  is  then  examined  as  a  function  of  each 
charge  state.  An  example  of  a  geometry  relaxation  is  shown  in  Fig.  4.2(b).  Within  the 
CB  the  accommodation  of  more  electrons  can  be  delocalized  throughout  the  structure  and 
virtually  no  local  geometric  relaxation  will  occur.  Likewise,  at  the  VB  edge  the  removal  of 
an  electron  (addition  of  a  hole)  can  result  in  the  partial  occupation  of  several  KS  eigenstates 
due  to  the  near  degeneracy  of  their  energies  and  thus  the  hole  will  also  be  delocalized,  again 
resulting  in  no  geometric  relaxation.  If  the  defect  state  lies  within  the  gap  however,  the 
charge  will  be  localized  at  the  defect  site  which  accommodates  electrons  (or  holes),  typically 
resulting  in  bonding  rearrangement.  This  generally  leads  to  shifts  in  the  energy  and  also  the 
appearance/disappearance  of  KS  eigenvalues  when  comparing  the  different  charge  states,  as 
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Fig.  4.2.  (a)  Sample  KS-spectrum  used  for  a  D-III-Si  defect  and  (b)  corresponding  geometry  relaxation  with 
each  charge  state  labeled  by  appropriate  color  (arrows  point  to  the  Ill-Si  involved). 


each  charge  state  corresponds  to  a  different  structure. 

A  parallel  analysis  of  the  KS-spectrum  and  the  local  geometric  structure  of  each  defect 
site  is  performed  to  facilitate  a  prediction  of  whether  or  not  the  defect  state  is  contained 
within  the  band  gap.  Each  defect  charge  transition  is  associated  with  a  characteristic  geometry 
relaxation.  From  these  a  geometry  relaxation  specific  charge  transition  spectrum  (analogous 
to  Fig.  4.1)  is  generated  shown  in  Fig.  4.3.  The  CB  edge  is  determined  by  taking  minimum 
energy  of  either  the  (1-/0)  or  (2-/1 -)  transition  levels  which  corresponded  to  sites  in  which 
the  KS  eigenvalue  that  had  zero,  single,  and  double  occupation  for  the  respective  (0),  (1-), 
and  (2-)  states  had  the  same  energy  and  no  geometric  relaxation  occurred.  The  VB  edge  is 
taken  as  the  maximum  energy  of  either  the  (0/1+)  or  (+/2+)  transitions  which  corresponded 
to  sites  that  led  to  partial  occupations  of  the  KS  eigenstates  for  the  (1+)  and  (2+)  charge 
states,  respectively,  and  that  had  no  local  geometric  relaxation. 

5.  Geometric  Relaxations.  We  now  provide  a  physical  interpretation  of  the  character¬ 
istic  geometry  relaxations  that  occurred  for  the  different  charge  transitions  investigated.  The 
percentage  that  each  relaxation  occurred  for  each  defect  and  transition  are  shown  as  a  pie 
charts  in  Figs.  5.1  -  5.6.  All  675  of  the  final  geometrically  relaxed  structures  and  135  im¬ 
ages  of  the  charge  state  comparisons  are  available  in  the  Supplementary  Material  [2].  The 
notation  we  adopt  for  each  relaxation  is  that  of  DEFECT (Transition ). number,  e.g  the  third 
relaxation  that  occurs  for  the  (0/1+)  charge  transition  in  a  system  containing  an  NOV  would 
be  NOV(0/1 +).///.  All  transitions  that  are  theorized  as  VB  or  CB  states  are  denoted  *.<9  ,  the 
lower  case  zero  used  as  the  number  because  no  localized  geometry  changes  occurred.  For 
conciseness  within  each  defect  sub-section  the  defect  name  is  omitted. 

NOV.  The  removal  of  a  single  electron  results  in  two  different  geometry  relaxations.  Al¬ 
most  ubiquitously,  an  increase  of  the  average  Si-Si  bond  length  of  -0.15  A  occurres  [(0/1+)./]. 
Suggesting  that  a  hole  is  captured  by  the  bond  and  the  remaining  electron  delocalizes  over 
the  dimerized  Si  sp 3  hybrid  orbitals, 

=Si  -  Si=  +  h+  =Si*  +Si=  (5.1) 

lending  credence  to  an  assumption  of  a  remaining  weaker  bond.  Secondly,  (0/1+).//  relax¬ 
ations  occur  in  systems  in  which  the  missing  oxygen  is  essentially  part  of  an  edge  sharing 
tetrahedron  (EST),  a  two  member  ring,  and  the  neutral  Si-Si  bond  distance  is  -2.2  A.  We 
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CHirje  Transit  kan  Lnels  [lK/191-atiK] 


Fig.  4.3.  Defect  levels  for  each  charge  transition  sorted  by  specific  corresponding  geometric  relaxation. 


shall  refer  to  the  systems  containing  this  defect  as  EST-NOV  and  it  should  be  regarded  as 
separate  from  the  NOV  as  characteristic  relaxations  undergone  are  exclusive  to  this  defect. 
The  (0/1+).//  involves  the  weakening  of  the  Si-Si  bond  through  a  distance  increase  of  ~0.4 
A  and  the  increase  of  the  Si-O-Si  bong  angle  by  -20°,  thus  this  transition  would  also  follow 
5.1. 

Absence  of  two  electrons  led  to  six  unique  geometry  relaxations.  (l+/2+).o  involves 
no  localized  geometry  change  and  partially  occupied  KS  eigenvalues,  again  suggesting  VB 
states.  Both  NOV  silicon  atoms  puckering  through  the  plane  of  their  nearest  neighbor  oxygen 
atoms  and  back-bonding  to  another  network  oxygen  constitute  the  (1+/2+)./  relaxation. 

=Si  -  Si=  +  2 h+  +=0  0=+  (5.2) 

This  results  in  two  positively  charged  III-O  centers.  Only  one  of  the  silicons  becomes  puck¬ 
ered  and  the  second  silicon  becoming  planar  with  its  oxygen  neighbors 

=Si  -  Si=  +  2 h+  +=0  +Si=  (5.3) 

is  (1+/2+).//.  The  three  planar  O  atoms  form  O-Si-O  bond  angles  of  -120°  suggesting  the  Si 
atom  involved  becomes  sp 2  hybridized  and  a  hole  is  trapped  on  the  remaining  Si  3 p  orbital 
which  lies  perpendicular  to  the  O  containing  plane.  (1+/2+).///  involves  both  silicon  atoms 
trapping  a  hole  and  becoming  planar. 

=Si  -  Si=  +  2 h+  =Si+  +Si=  (5.4) 

A  dramatic  local  network  rearrangement  in  which  two  nearby  oxygen  atom  become  over¬ 
coordinated  and  form  a  two  member  ring  containing  the  two  Si  from  the  NOV  and  the  two 
new  III-O  is  the  (l+/2+)./v  relaxation.  This  relaxtion  would  take  on  the  same  form  as  5.2. 
Finally,  in  the  EST-NOV  systems,  (l+/2+).v  entailed  further  increases  the  Si-O-Si  bond  an¬ 
gle  to  -130°,  a  value  within  the  experimentally  observed  range  of  Si-O-Si  bond  angles  in 
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Fig.  5.1.  Geometric  relaxation  occurrence  percentages  for  the  NOV. 


(0/14) 


(Hao 


a- SiC>2,  and  the  silicon  atoms  moving  into  the  plane  of  their  three  neighboring  oxygen  atoms, 
fully  breaking  the  Si-Si  bond.  This  leaves  two  positively  charged  Ill-Si  who  share  a  bridging 
oxygen  atom  and  would  be  chemically  equivelent  to  5.4. 

Addition  of  a  single  electron  only  led  to  two  relaxations.  Most  prominently,  no  localized 
geometry  change  and  no  shift  in  the  energy  of  the  occupation  effected  eigenvalue  of  the 
aligned  KS-spectrum  constitutes  (l-/0).6>,  indicating  a  delocalized  CB  state.  And  (1-/0)./, 
seen  only  in  the  EST-NOV  systems,  the  Si-O-Si  angle  shifts  to  85°  and  the  Si-Si  bond  distance 
increases  to  2.4  A,  indicating  a  weakening  but  no  change  chemical  bonding. 

Three  relaxations  were  observed  upon  doubly  adding  electrons.  Frequently,  (2-/1 -). o, 
no  local  geometry  or  KS  eigenvalue  energy  shift.  Very  interestingly,  (2-/1-)./  involves  a 
non-nearest  neighbor  bridging  oxygen  atom  in  the  network  breaking  one  of  its  Si  bonds  and 
becoming  a  singly  coordinated  non-bridging  oxygen  (NBO)  and  leaving  behind  a  triply  co¬ 
ordinated  silicon. 

=Si  -  Si=  +  2e~  =Si  -  Si=  +  -OS”  +  =Si:~  (5.5) 

This  new  Ill-Si  has  an  average  bond  angle  of  -104°,  suggesting  that  it  has  paired  electrons 
localized  on  its  non-bonding  sp3  orbital  and  the  NBO  takes  on  the  second  electron  to  complete 
the  filling  of  its  singly  occupied  2 p  orbital.  Lastly,  again  only  seen  in  EST-NOV  systems,  (2- 
/I -).//  is  a  rupture  of  the  Si-Si  bond  as  they  further  separate  by  0.5  A  and  the  Si-O-Si  angle 
moves  to  1 13°.  Accompanying  (2-/1-).//,  the  retreat  from  the  initial  oxygen  vacant  site  results 
in  one  of  the  silicons  forming  a  new  Si-Si  bond  with  its  second  nearest  neighbor  Si,  forming 
a  new  EST-NOV  with  a  Si-Si  bond  length  of  2.1  A  and  Si-O-Si  angle  of  75°. 

D-III-Si.  In  the  neutral  state  the  KS-spectrum  is  characterized  by  two  partially  occupied 
mid-gap  states,  presumed  to  be  the  two  silicon  dangling  bonds.  All  of  the  charge  transitions 
associated  with  the  D-III-Si  are  theorized  to  lie  within  the  band  gap.  There  are  three  observed 
different  geometry  relaxations  observed  to  occur  for  the  removal  of  a  single  electron.  Most 
commonly,  (0/1+)./,  upon  the  removal  of  one  electron  one  of  the  Ill-Si  moves  into  the  plane 
of  its  three  oxygen  neighbors. 

=Si*  +  =Si*  +  h+  — >  =Si*  +  =Si+  (5.6) 

Here,  as  with  NOV(l+/2+).//  and  NOV(l+/2+).///,  the  O-Si-O  bond  angles  around  the  newly 
unpuckered  Ill-Si  are  all  -120°,  again  implying  that  the  bonds  become  sp 2  hybridized  leaving 
a  hole  localized  on  a  3p  orbital  perpendicular  to  the  Si=C>3  plane.  (0/1+).//  is  the  partial 
movement  of  each  of  the  two  Ill-Si  toward  the  O  containing  plane.  A  network  rearrangement 
to  allow  an  oxygen  to  move  closer  to  one  of  the  Ill-Si  comprises  (0/1+).///. 

Upon  the  removal  of  the  second  electron  one  of  three  geometry  relaxations  occur.  Pri¬ 
marily,  (1+/2+)./,  the  Ill-Si  not  involved  in  a  (0/1+)./  relaxation  it  replicates  the  performance 
of  the  first  Ill-Si,  resulting  in  two  isolated  3 p  orbital  trapped  holes. 


=Si*  +  =Si*  +  2  h+ 


=Si+  +  =Si 


(5.7) 
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Fig.  5.2.  Geometric  relaxation  occurrence  percentages  for  the  D-III-Si. 


A  second  relaxation,  (1+/2+).//,  occurs  when  both  partially  planer  Ill-Si  involved  in  (0/1+).// 
become  planar  with  -120°  O-Si-O  bond  angles.  Again  leaving  two  isolated  Ill-Si  with  3 p  or¬ 
bital  trapped  holes,  and  is  chemically  equivalent  to  5.7.  Another  third  relaxation,  (1+/2+).///, 
involves  the  Ill-Si  nearest  the  oxygen  involved  in  (0/1+).///  becomes  tetrahedrally  coordinated 
forming  a  positively  charged  III-O  center. 

=Si*  +  =Si*  +  2  h+  — >  +=0  +  =Si+  (5.8) 

Also,  (1+/2+).///  consists  of  the  Ill-Si  that  doesn’t  undergo  a  coordination  change  to  becom¬ 
ing  planar  with  its  three  oxygen  neighbors  in  analogous  fashion  as  (0/1+)./,  resulting  in  a  3 p 
localized  hole. 

Adding  an  electron  to  a  neutral  system  yielded  two  relaxations  which  occur  with  similar 
frequencies.  The  (1-/0)./,  in  which  the  average  O-Si-O  bond  angle  on  both  Ill-Si  decreases 
from  ~107°  to  ~105°,  which  suggests  that  the  sp 3  hybridization  slightly  weakens  and  the 
extra  electron  is  delocalized  over  both  dangling  bonds.  Or  (1-/0).//,  just  one  of  the  Ill-Si  atoms 
tightens  its  average  O-Si-O  bond  angle  from  -107°  to  -103°  implying  the  sp3  hybridization 
weakens, 


=Si*  +  =Si*  +  e~  ^  =Si*  +  =Sir  (5.9) 

leading  to  a  doubly  occupied  orbital  and  the  extra  negative  charge  density  results  in  Coulomb 
repulsion  of  the  electrons  in  the  Si=03  bonds. 

Upon  adding  a  second  electron,  two  relaxations  were  found  to  take  place,  again  with 
similar  frequencies.  The  (2-/1-)./,  where  both  the  III- Si  involved  in  (1-/0)./  tighten  there  O- 
Si-0  angle  to  -102°,  both  become  diamagnetic  with  paired  sp3  electrons  localized  on  its 
non-bonding  orbital. 

=Si*  +  =Si*  +  2e~  =SiU  +  =Sir  (5.10) 

Or  (2-/1-).//,  where  the  Ill-Si  not  involved  in  (1-/0).//  undergoes  consonant  relaxation,  leaving 
a  final  geometry  which  is  indistinguishable  from  that  which  follows  (2-/1-)./. 

Ill-O/III-Si.  The  neutral  KS-spectrum  for  this  defect  is  characterized  by  a  doubly  occu¬ 
pied  mid  gap  state.  The  Ill-Si  involved  in  this  defect  pair  had  an  average  O-Si-O  bond  angle 

of  ~103°,  which  leads  us  to  hypothesize  that  this  Ill-Si  has  a  doubly  occupied  sp3  orbital 

where  the  Coulomb  repulsion  bends  the  other  Si-0  bonds  away.  This  extra  electron  likely 
came  from  the  III-O  center  which  loses  an  electron  to  become  positively  charged.  Thus,  the 
Ill-Si  is  able  to  stabilize  the  III-O  by  taking  on  an  additional  electron,  and  the  pair  of  elec¬ 
trons  localized  on  the  dangling  bond  of  this  Ill-Si  corresponds  to  the  observed  neutral  mid 
gap  state. 

One  characteristic  relaxation  occurs  upon  the  single  removal  of  an  electron.  The  Ill-Si 
average  O-Si-O  bond  angle  increases  to  -109°  [(0/+)./],  which  implies  a  neutral  paramagnetic 
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Fig.  5.3.  Geometric  relaxation  occurrence  percentages  for  the  Ill-O/III-Si. 


dangling  bond  consisting  of  a  single  an  unpaired  electron. 

+=0  +  =Sir  +  h+  — >  +=0  +  =Si*  (5.11) 

This  is  further  supported  by  the  (1+)  KS-spectrum  which  contains  a  singly  occupied  mid  gap 
state. 

Removal  of  a  second  electron  leads  to  two  different  relaxations.  The  Ill-Si  moving  into 
the  plane  of  its  O  neighbors  with  a  O-Si-O  angle  of  -120°  [(+/2 +)./].  As  mentioned  in  above, 
we  propose  this  is  a  sp 2  hybridized  Ill-Si  with  a  3 p  trapped  hole. 

+=0  +  =Si  r  +  2  h+  — >  +=0  +  =Si+  (5.12) 

A  second  relaxation,  (+/2 +).//,  involves  a  local  network  rearrangement  in  which  a  oxygen  in 
the  the  vicinity  of  the  Ill-Si  overcoordinates  with  the  Ill-Si,  resulting  in  a  second  positively 
charged  III-O  center. 

+=0  +  =Sir  +  2  h+  — >  +=0  +  +=0  (5.13) 

Adding  an  electron  results  in  two  relaxations.  The  first,  (1-/0). o,  is  no  change  in  local 
geometry  or  energy  of  the  occupation  effected  KS  eigenvalue.  The  second,  (1-/0)./,  involves 
the  breaking  one  of  the  Si-0  bonds  involved  in  the  III-O  leaving  a  second  Ill-Si  with  ~106° 
average  O-Si-O  bond  angle,  pointing  to  an  unpaired  electron  on  a  sp 3  orbital. 

+=0  +  =Sir  +  e~  — >  =Si*  +  =SiS"  (5.14) 

A  second  electron  addition  leads  to  seven  different  relaxations.  Firstly,  (2-/1  -).<?,  a  re¬ 
laxation  identical  to  (1-/0). o.  Secondly,  the  Ill-Si  involved  in  (1-/0)./  contracts  its  average 
O-Si-O  bond  angle  to  -99°  [(2-/1-)./].  This  relaxation,  just  as  seen  in  other  relaxations,  im¬ 
plies  that  the  sp 3  hybridization  is  very  small  and  the  two  electrons  occupy  a  filled  approximate 
3a  orbital  with  the  Si-0  bonds  formed  on  the  three  3 p  orbitals. 

+=0  +  =SiS"  +  2e~  — >  =Sir  +  =Sir  (5.15) 

A  third  relaxation,  (2-/1-).//  is  analogous  to  (1-/0)./.  Another  relaxation,  (2-/1-)./// ,  involves  a 
bond  breaking  between  an  O  nearest  neighbor  of  a  properly  coordinated  Si  atom  and  forming 
a  new  bond  with  a  different  fully  coordinated  network  Si.  The  resulting  atomic  geometry  then 
contains  a  two  Ill-Si,  one  V-Si,  and  one  III-O. 

+=0  +  =Sir  +  2e~  — >  =Sir  +  =Sir  +=0  +  =Si="  (5.16) 

For  both  Ill-Si  the  average  O-Si-O  bond  angle  is  -103°,  thus  we  presume  paired  electrons  to 
be  localized  on  an  sp3  orbital  of  both  of  these.  We  also  predict  that  the  III-O  has  given  up 
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an  electron  which  localizes  near  the  V-Si,  stabilizing  this  defect  pair  and  resulting  in  the  net 
system  charge  of  (2-).  A  relaxation  [(2-/1 -)./v]  in  which  the  Ill-Si  involved  in  (1-/0)./  becomes 
roughly  planar  with  an  average  O-Si-O  bond  angle  of  -116°  (it  becomes  sp2  hybridized  and 
traps  a  hole  on  its  3 p  orbital). 

+=0  +  =Sir  +  2e~  — >  =Si  r  +  =SiS"  =Si+  +  -O  l~  (5.17) 

This  relaxation  also  involves  a  nearby  fully  coordinated  silicon  breaking  a  Si-0  bond  and 
becoming  undercoordinated  with  an  average  O-Si-O  bond  angle  of  -95°,  which  suggests  it 
is  not  hybridized  with  a  filled  3  s  orbital  and  forms  its  three  O  bonds  with  its  singly  occupied 
3 p  orbitals.  In  addition  to  these  unique  rearrangements,  (2-/1 -)./v  also  involves  a  nearby 
bridging  oxygen  breaking  one  of  its  bonds,  leaving  a  negatively  charged  NBO  with  a  pair  of 
electrons  localized  on  its  2 p  orbital.  Yet  another  relaxation,  (2-/l-).v,  involves  a  tetrahedrally 
bonded  silicon  near  the  III-O  breaking  one  of  its  Si-0  bonds  and  becoming  planar  with  its 
three  remaining  oxygen  neighbors  with  an  average  O-Si-O  bond  angle  of  -120°  (again  a  sp2 
Ill-Si  with  a  3 p  trapped  hole). 

+EO  +  =Sil~  +  2e~  — >  =Si+  +  -01“  =Si:2“  +  2(-0*)  (5.18) 

The  O  involved  in  the  broken  Si-0  bond  becomes  a  singly  coordinated  NBO  which  has 
trapped  an  electron.  Another  bonding  change  which  occurs  in  (2-/1 -).v  involves  another 
nearby  fully  coordinated  silicon  with  an  average  Si-0  bond  distance  of  1.7  A,  three  of  its 
O-Si-O  bond  angles  are  -99°,  and  its  fourth  bond  angle  is  an  obtuse  -130°.  The  relaxation 
that  occurs  involves  widening  the  obtuse  angle  to  -165°  and  stretching  the  Si-0  bonds  which 
are  a  part  of  that  angle  to  ~1.9  A.  The  other  two  bonds  also  stretch  to  -1.8  A  but  their  O-Si-O 
angles  undergo  negligible  changes.  We  interpret  this  as  a  11-Si  with  very  little  hybridization 
containing  a  filled  3s  orbital  and  weak  3 p  bonds  with  the  two  nearer  oxygens,  it  is  here  that 
the  (-2)  charge  is  localized.  The  remaining  two  oxygen  atoms  can  each  be  thought  of  as 
an  NBO  with  an  unpaired  electron,  this  is  the  well  known  and  experimentally  characterized 
electrically  neutral  paramagnetic  non-bridging  oxygen  hole  center  (NBOHC)  [5].  Finally, 
(2-/l-).v/,  is  a  relaxation  synonymous  with  (2-/l-).v  except  the  positively  charged  sp 2  Ill-Si 
and  negatively  charged  NBO  relaxations  do  not  occur  (i.e.  the  doubly  negative  11-Si  with  two 
nearby  NBOHC’ s  is  the  only  relaxation  which  occurs). 

+EO  +  =Sil~  +  2e~  — >  =S'i:2“  +  2(-0*)  (5.19) 

III-Si/V-Si.  Analysis  of  the  neutral  charge  state  of  this  defect  showed  that  all  the  Ill-Si 
were  in  a  planar  arrangement  with  their  three  neighboring  oxygens  with  -120°  O-Si-O  bond 
angles.  As  aforementioned,  this  is  the  same  resulting  positively  charged  diamagnetic  Ill-Si 
defect  that  is  observed  in  several  oxygen  deficient  samples  after  the  removal  of  an  electron. 
This  is  thought  to  be  sp2  hybridized  with  a  3p  localized  hole,  hence  the  electron  localized 
somewhere  else  in  the  structure.  We  predict  that  the  electron  localization  takes  place  near  the 
V-Si  allowing  for  the  extra  Si-0  bond,  thus  the  positively  charged  Ill-Si  stabilizes  the  V-Si. 

Upon  the  removal  of  an  electron  two  relaxations  occurred.  Either  no  geometry  change 
and  many  partially  occupied  KS  eigenvalues  [(0/l+).6>],  or  movement  of  one  or  multiple  V- 
Si  nearest  neighbor  oxygen  atoms  [(0/1+)./]  (supporting  the  hypothesis  of  the  missing  Ill-Si 
electron  localizing  at  the  V-Si).  A  removal  of  another  electron  results  in  no  geometry  change 
and  many  partial  KS  eigenvalue  occupations,  likely  a  VB  transition. 

Adding  an  electron  results  in  the  planar  III- Si  moving  through  the  O  plane  and  becoming 
roughly  tetrahedral  with  an  average  O-Si-O  bond  angle  of  -107°  [(1-/0)./].  Implying  sp 3 
hybridization  with  an  unpaired  electron  localized  on  the  non-bonding  orbital. 


=Si+  +  =Si—  +  c 


=Si*  +  =Si=- 


(5.20) 
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Fig.  5.4.  Geometric  relaxation  occurrence  percentages  for  the  III-Si/V-Si. 

Addition  of  a  second  electron  leads  to  an  O-Si-O  angle  of  -102°  [(2-/1-)./],  suggesting  paired 
electrons  on  the  non-bonding  hybrid  orbital  with  Coulomb  repulsion  bending  the  other  three 
bonds  away. 


=Si+  +  =Si=“  +  2e~  — >  =SiS"  +  =Si=~  (5.21) 

The  fact  that  only  one  characteristic  relaxation  occurs  upon  the  single  and  double  addition  of 
electrons  can  be  attributed  to  the  relatively  tight  charge  transition  bands  seen  in  Figs.  4.1  and 
4.3. 

III-O/V-SL  In  a  manner  similar  to  the  positively  charged  planar  III- Si  stabilizing  the 
V-Si  via  electron  donation  in  the  III-Si/V-Si,  we  hypothesize  that  the  III-O  has  given  up  an 
electron  which  again  stabilizes  the  V-Si.  The  fact  that  analogous  relaxations  as  those  seen  in 
III-Si/V-Si  occur  for  the  removal  of  one  and  two  electrons  supports  this  hypothesis.  Single 
electron  removal  either  leads  to  no  geometry  change  and  partial  KS  eigenvalue  population 
[(0/l+).o]  or  movement  of  at  least  one  of  the  V-Si  oxygen  neighbors  [(0/1+)./].  Double 
electron  removal  leads  again  to  no  geometry  change  and  considerable  number  of  partial  KS 
eigenvalue  occupations  [(l+/2+).o]. 

Addition  of  an  electron  causes  no  change  in  geometry  or  occupation  effected  KS  eigen¬ 
value  [(1-/0). r>],  or  breaks  one  of  the  III-O  bonds  causing  a  Si  to  retreat  away  becoming 
undercoordinated  [(1-/0)./]. 

+=0  +  =Si="  +  e~  — >  =Si*  +  =Si="  (5.22) 

This  new  Ill-Si  is  nearly  tetrahedral  with  an  average  O-Si-O  bong  angle  of  -107°,  again  sp3 
hybridized  with  an  unpaired  electron  on  the  empty  orbital. 

A  second  electron  induces  four  relaxations.  No  structural  change  or  occupation  effected 
KS  eigenvalue  energy  shift  [(2-/1  -).o].  For  systems  which  underwent  (1-/0). o  a  relaxation 
parallel  to  (1-/0)./  occurs  [(2-/1-)./],  indicative  of  a  negative-//  behavior.  As  for  the  systems 
that  were  involved  in  (l-/0).i,  the  average  O-Si-O  angle  for  the  effected  Ill-Si  further  decreases 
to  -102°  [(2-/1-).//], 

+=0  +  =Si=-  +  2e~  — >  =Sir  +  =Si="  (5.23) 

the  second  electron  pairs  on  the  singly  occupied  orbital  and  electrostatics  repel  the  remaining 
three  bonds.  Finally,  (2-/1-).///,  consists  of  the  same  re-bonding  as  (2-/1-)./  however  it  is 
accompanied  by  a  significant  rearrangement  to  the  local  network  around  the  new  Ill-Si. 

It  is  important  to  note  here  that  the  final  geometries  after  the  occurrence  of  the  III-Si/V- 
Si(l-/0)./  and  III-O/V-Si(l-/0)./  are  virtually  indistinguishable  from  one  another,  a  single  un¬ 
paired  electron  on  a  Ill-Si  sp3  orbital  and  a  V-Si.  Likewise  the  final  geometries  from  the 
III-Si/V-Si(2-/l-)./  and  III-0/V-Si(2-/l-).//  are  also  equivalent,  a  doubly  occupied  Ill-Si  sp3 
orbital  and  a  V-Si.  This  is  consistent  with  the  observation  that  removal  of  an  unpaired  sp 3 
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Fig.  5.5.  Geometric  relaxation  occurrence  percentages  for  the  Ill-O/V-Si. 


electron  from  a  III- Si  in  the  oxygen  deficient  cells  yielded  two  results  with  very  high  fre¬ 
quency:  a  planar  sp 2  Ill-Si  with  a  3 p  trapped  hole  (identical  to  the  Ill-Si  in  the  III-Si/V-Si 
systems)  or  an  positively  charged  III-O  center  (akin  to  the  III-O  in  the  III-0/V-Si  systems). 

11-Si.  A  fully  occupied  mid-gap  state  exists  for  the  neutral  species,  likely  two  electrons 
that  are  each  localized  on  separate  sp3  orbitals.  This  conclusion  is  drawn  from  the  average 
O-Si-O  bond  angle  of  110°.  Upon  the  removal  of  one  electron  the  local  network  undergoes 
a  rearrangement  [(0/1+)./]  in  such  a  matter  that  the  11-Si  and  a  neighboring  oxygen  become 
coordinated, 


=Si-  +  h+  — >  +=0  +  =Si*  (5.24) 

leading  to  a  Ill-Si  and  a  positively  charged  III-O.  Removal  of  a  second  electron  furthers  net¬ 
work  rearrangement  [(1+/2+)./]  inhibiting  a  new  bond  between  the  Ill-Si  and  another  network 

oxygen,  resulting  in  a  fully  coordinated  silicon  and  two  positively  charged  III-O  centers. 

=Si-  +  2  h+  — >  +=0  +  +=0  (5.25) 

Addition  of  electrons  leads  to  two  relaxations  for  each  transition.  For  the  single  electron 
addition,  (1-/0). o  has  no  local  geometry  change  and  no  occupation  effected  KS  eigenvalue 
energy  shift,  assumed  to  be  a  delocalized  CB  state.  The  other  relaxation,  (1-/0)./,  the  addition 
of  a  single  electron  leads  to  a  O-Si-O  bond  angle  of  105°  implying  that  the  electron  pairs  with 
one  of  the  sp3  electrons  and  the  Coulomb  repulsion  bends  the  Si=C>2  bonds  away  from  the 
doubly  occupied  orbital. 


=Si-  +  e~  — >  =Si:~  (5.26) 

Double  electron  addition  again  either  has  no  effect  on  the  geometry  and  the  occupation  ef¬ 
fected  KS  eigenvalue  [(2-/1  -).<?],  or  the  second  electron  causes  the  bond  angle  to  dramatically 
shift  to  96°  [(2-/1-)./]  which  suggests  that  the  Si  hybridization  disappears  and  the  electron 
configuration  consists  of  a  filled  3s  orbital  and  singly  occupied  3 p  orbitals. 

=Si-  +  2e~  — >  =Si:2“  (5.27) 

In  (2-/1-)./,  Si=C>2  bonds  are  formed  with  two  of  these  3 p  orbitals  and  paired  electrons  are 
left  on  the  remaining  orbital  with  Coulomb  repulsion  bending  the  Si=C>2  bonds  perpendicular 
to  the  idealized  O-Si-O  plane  causing  the  angle  to  be  slightly  larger  than  90°. 

6.  Discussion.  In  this  section  we  separately  discuss  the  implications  of  the  statistically 
prevalent  seen  relaxations  which  were  observed  as  they  apply  to  each  isolated  defect. 

Ill-Si.  Perhaps  the  most  significant  result  seen  here  involves  the  III- Si.  This  defect  is 
seen  to  occur  isolated  and  in  three  different  charge  states:  a  neutral  paramagnetic  state  in 
which  an  unpaired  electron  is  localized  on  a  sp 3  hybrid  orbital  (=Si*),  a  diamagnetic  (+1) 
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Fig.  5.6.  Geometric  relaxation  occurrence  percentages  for  the  11-Si. 


state  in  which  the  the  Ill-Si  is  sp 2  hybridized  and  planar  with  its  three  O  neighbors  with  a 
hole  localized  on  its  remaining  3 p  orbital  (=Si+),  or  a  diamagnetic  (-1)  state  in  which  the  non¬ 
bonding  sp3  orbital  is  doubly  occupied  by  a  pair  of  electrons  (=Si!-).  Thus  it  is  the  neutral 
state  which  is  EPR  susceptible.  These  neutral  isolated  =Si*  fragments  can  be  viewed  as  the 
commonly  proposed  NOV  derived  Ey  center  [19]  in  the  absence  of  the  puckered  Ill-Si.  Some 
authors  believe  that  these  fragments  correspond  to  the  center  [1],  but  there  exist  other 
suggestions  that  the  involves  a  nearby  hydrogen  atom  at  a  H-Si=  center  (the  additional 
proton  shifts  the  hyperfine  splitting  of  the  EPR  signal)  [5].  Depending  on  the  distance  of  this 
H-Si=  unit  from  the  =Si*  this  hyperfine  shift  could  be  very  small  and  essentially  negligible 
in  experiment,  therefore  we  conclude  that  the  neutral  isolated  =Si*  fragments  observed  in 
this  study  can  be  thought  of  as  either  Ey  or  E^  centers.  Calculation  of  the  spectroscopic 
and  hyperfine  splitting  factors  for  each  of  the  =Si*  defects  found  could  offer  distinction,  but 
presently  we  emphasize  the  claim  of  neutral  isolated  E'  defects,  regardless  of  the  presence  of 
an  oxygen  vacancy  [3],  is  firmly  supported  by  this  work. 

Additionally,  removing  or  adding  an  electron  to  the  =Si*  yields  respective  =Si+  or  =Si!- 
fragments.  Likewise  starting  with  =SiI“  and  doubly  removing  electrons  leads  to  =Si+,  and 
starting  with  =Si+  and  doubly  adding  electrons  leads  to  =Sil~.  This  provides  an  indica¬ 
tion  of  a  hysteresis  among  Ey  and  E^  centers  in  which  they  can  switch  between  being  hole, 
electron/hole,  and  electron  capture  centers.  This  reversible  behavior  of  E'  centers  has  been 
demonstrated  experimentally  [19],  however  the  atomic  relaxations  seen  here  contradict  pre¬ 
vious  proposals  which  account  for  this  behavior  that  require  the  presence  of  an  NOV. 

NOV.  Considering  the  pure  NOV  defects  (not  the  small  percentage  of  EST-NOV  defects) 
an  addition  of  a  single  electron  universally  results  in  an  increase  in  the  Si- Si  dimer  bond  length 
of  ~0. 1-0.2  A.  This  implies  a  =Si+  *Si=  center  is  created  through  the  capture  of  a  hole  by 
the  dimer  bond.  The  weaker  bond  is  likely  due  to  the  delocalization  of  the  unpaired  electron 
over  the  remaining  pair  of  sp3  hybrid  orbitals.  Previous  theoretical  studies  have  reported  this 
structure  as  having  EPR  signals  in  good  experimental  agreement  with  the  E'd  center  [4,  1 8] . 
Hence,  our  results  further  the  support  of  the  creation  of  positively  charged  paramagnetic  E’s 
centers  as  a  result  of  NOV  hole  trapping. 

III-O.  We  find  that  this  defect  exists  in  a  positively  charged  diamagnetic  state  (+=0). 
The  neighboring  silicon  atoms  lie  essentially  in  the  plane  of  the  overcoordinated  oxygen  with 
Si-O-Si  angles  of  ~120°,  suggesting  that  the  O  is  sp2  hybridized  with  a  doubly  occupied 
2 p  orbital  and  forming  the  three  bonds  in  each  of  the  sp 2  orbitals.  The  Si-0  bonds  are 
weaker  in  comparison  to  those  involved  in  two-fold  bridging  oxygen  with  lengths  roughly 
0.2  A  longer,  we  therefore  hypothesize  that  the  hole  is  delocalized  over  the  three  nearly  D^h 
symmetric  hybrid  orbitals,  accounting  for  the  reason  that  the  positive  charge  is  located  on  the 
three  silicons  in  our  adopted  notation  of  +=0  for  this  defect.  This  defect  is  found  to  be  an 
electron  capture  center,  upon  which  one  of  the  three  Si-0  bonds  breaks  leaving  behind  an 
isolated  neutral  paramagnetic  Ill-Si  (i.e.  an  Ey  or  E £  center).  The  reverse  reaction  is  also 
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seen  to  take  place,  an  Ey  or  Ep  center  that  is  nearby  a  network  oxygen  will  capture  a  hole  and 
form  +=0  center  with  its  proximal  network  O  neighbor.  This  relaxation  can  be  interpreted 
as  nearly  analogous  to  the  proposed  NOV  capture  of  a  hole  and  one  of  the  silicon  atoms 
becoming  puckered  by  back-bonding  to  another  network  oxygen,  the  originally  accepted  Ey 
center  formation  mechanism  [17].  However,  here  no  oxygen  vacancy  is  necessary  and  the 
puckered  silicon  precursor  is  an  isolated  =Si*  fragment.  We  note  that  the  formation  of  +=0 
centers  does  occur  upon  removing  two  electrons  from  a  NOV  defect.  As  stated  above  the 
removal  of  one  electron  from  the  NOV  results  in  a  positively  charged  E's  center.  Removal  of 
a  second  electron  leads  to  either  two  +=0  centers  or  a  +=0/=Si+  pair,  both  with  negative-  U 
behavior.  In  a  fashion  similar  to  the  Ill-Si,  this  implies  a  hysteric  behavior  in  which  +=0 
centers  are  converted  into  =Si*  and  vice-versa  through  the  respective  capture  of  an  electron 
and  a  hole. 

V-Si.  This  defect  is  a  negatively  charged  diamagnetic  center  (=Si=“).  An  extra  electron 
allows  for  the  creation  of  the  fifth  Si-0  bond.  This  additional  bond  is  known  as  a  floating 
bond  and  has  been  hypothesized  to  exist  both  theoretically  and  experimentally  in  amorphous 
silicon  (a-Si)  [11,  8],  and  tentatively  identified  experimentally  in  =Si=-  [10].  We  find  that 
these  =Si=“  can  act  as  hole  traps  resulting  in  a  neutral  defect  centers  with  weaker  Si-0  bands, 
likely  indicative  that  the  hole  is  delocalized  over  the  five  bonds.  Taking  into  account  the 
relatively  low  formation  energy  of  this  defect  [3],  we  conclude  that  this  defect  is  a  positive 
charge  trap  that  has  not  previously  been  well  characterized. 

II-Si.  We  find  that  this  defect  exists  in  three  different  charge  states:  neutral,  negative,  and 
doubly  negative.  The  (0)  state  has  an  unpaired  electron  in  each  of  its  two  non-bonding  sp3 
orbitals.  Upon  electron  capture  the  (-1)  state  involves  the  an  unpaired  electron  on  one  of  these 
orbitals  and  a  pair  of  electrons  on  the  other  orbital.  A  capture  of  a  second  electron  leads  to 
a  (-2)  state  and  the  disappearance  of  hybridization  with  a  filled  3s  orbital,  a  pair  of  electrons 
on  one  3 p  orbital,  and  the  Si-0  bonds  formed  with  the  remaining  two  orthogonal  3p-orbitals. 
This  indicates  that  the  II- Si  is  an  electron  trapping  center,  however  the  small  amount  of  these 
defects  studies  does  not  lend  statistical  support  to  this  hypothesis.  We  also  find  this  defect 
is  involved  with  trapping  positive  charge  as  one  and  two  +=0  centers  result  from  a  local 
network  rearrangement  involving  nearby  oxygen  atoms  upon  the  respective  capture  of  one 
and  two  holes. 

7.  Conclusion.  We  have  used  density  functional  techniques  which  properly  account  for 
the  electrostatics  in  the  presence  of  PBCs  to  study  an  array  of  charge  states  for  numerous 
isolated  defect  sites  a-Si02.  In  addition  to  identifying  the  charge  transition  levels  of  various 
charge  traps  within  a-SiC>2,  specific  atomic  geometry  relaxations  corresponding  to  each  level 
have  been  reported.  We  found  that  isolated  paramagnetic  Ey  and  Ep  defects  occur  in  the 
neutral  charge  state  and  are  capable  of  trapping  both  electrons  and  holes.  Statistical  support 
for  the  NOV  originated  dimerized  model  of  the  E'g  defect  has  also  been  demonstrated.  As 
the  result  of  hole  trapping  by  an  undercoordinated  silicon  we  showed  a  significant  number  of 
stable  +=0  defects  result  and  are  likely  a  prevalent  source  of  trapped  positive  charge.  Finally, 
support  for  a  not  well  established  hole  trap  associated  with  the  an  overcoordinated  silicon 
floating  bond  defect  has  been  presented. 
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WHEN  NANO-PARTICLES  COLLIDE  VIA  MOLECULAR  DYNAMICS 

YOICHI  TAKATO*,  JEREMY  B.  LECHMAN',  AND  SURAJIT  SEN* 

Abstract.  We  perform  nanoparticle  collision  simulations  using  molecular  dynamics,  and  calculate  the  coeffi¬ 
cient  of  restitution  for  two  identical  nanoparticles  made  of  an  fee  lattice  of  Lennard-Jones  atoms  whose  size  ranges 
from  603  to  44, 403  atoms  per  particle.  Varying  the  collision  velocity  we  find  distinct  regimes  of  particle  behavior 
—  elastic  and  plastic  deformations.  In  the  plastic  deformation  region  there  appear  to  be  three  distinct  deformation 
modes;  dislocations  of  atoms  in  a  particle,  large  permanent  deformations  of  a  particle,  and  phase  transition  effects. 
We  make  preliminary  comparisons  of  nanoparticle  behavior  to  that  of  macroscopic  particles. 


1.  Introduction.  The  collision  of  particles  is  a  fundamental  process  that  occurs  in  many 
fields  of  applied  science:  aerosol  aggregation  and  collection,  formation  of  planetary  systems, 
powder  and  ceramics  processing,  initiation  of  energetic  materials  and  blast  mitigation.  It  is  a 
simple  process  that  is  often  used  in  undergraduate  physics  classes  in  introduce  the  concepts 
on  momentum  and  energy  balance  in  the  context  of  classical  mechanics.  It  is  in  this  context 
that  many  have  been  introduced  to  the  “coefficient  of  restitution”  which  is  simply  the  ratio 
of  the  relative  velocities  of  the  two  particles  before  and  after  the  collision.  The  coefficient 
of  restitution  is  practically  a  convenient  physical  quantity  because  without  knowing  any  mi¬ 
croscopic  details  about  the  collision  process,  it  is  possible  to  determine  how  much  energy 
is  dissipated  or  converted  to  other  forms  during  the  collision.  If  the  relative  velocity  of  the 
particles  is  the  same  after  the  collision,  the  collision  is  said  to  be  elastic  and  the  coefficient  of 
restitution  is  unity.  If  the  relative  velocity  is  different,  the  coefficient  of  restitution  becomes 
less  than  one.  Therefore,  many  experiments  involving  collisions  of  macroscopic  objects  have 
been  performed,  and  their  results  have  been  analyzed  in  terms  of  the  coefficient  of  restitution. 

Many  theoretical  studies  of  quasi-static  contact  between  two  macroscopic  objects  have 
been  carried  out.  Hertz  [1]  derived  a  nonlinear  contact  force  between  elastic  spheres,  namely, 
Hertzian  contact  theory.  For  particles  colliding  under  the  influence  of  the  quasi-static  force, 
the  coefficient  of  restitution  is  unity;  no  energy  lost  from  the  center-of-mass  translational 
degrees  of  freedom  during  the  collision.  In  the  case  of  plastic  deformation,  the  coefficient  of 
restitution  is  less  than  unity.  Johnson  [2]  has  shown  that  the  relation  between  the  coefficient 
of  restitution  and  collision  impact  velocity  can  be  described  by  a  power  law;  the  coefficient 
restitution  decreases  with  increasing  collision  velocity  beyond  the  collision  velocity  at  which 
plastic  yield  begins. 

However,  from  the  microscopic  point  of  view,  the  collision  process  is  still  poorly  under¬ 
stood.  In  particular  the  dissipation  mechanism  associated  with  plastic  collisions  (coefficient 
of  restitution  less  than  one)  are  poorly  understood.  The  kinetic  energy  loss  (from  macro¬ 
scopic,  center-of-mass  point  of  view)  can  be  distributed  into  internal  energy  (e.g.  heat  loss 
and/or  a  change  in  the  total  potential  energy)  or  kinetic  energy  other  than  center-of-mass 
translational  kinetic  energy  (e.g.  rotational  or  vibrational  kinetic  energy  of  particles).  To  our 
knowledge,  no  studies  that  analyze  the  energy  transformation  and  dissipations  mechanisms 
and  relate  them  to  the  coefficient  of  restitution  over  a  wide  range  of  collision  velocities  for  a 
simple  atomistic  system  have  been  performed. 

In  principle,  the  molecular  dynamics  (MD)  simulation  can  calculate  any  instantaneous 
microscopic  physical  quantities  such  as  position  and  velocity  of  each  atom,  energy  and  force 
associated  with  atom’s  motion  and  position.  The  collision  process  takes  place  in  a  very  short 
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time,  and  hence,  it  is  quite  difficult  to  track  it  by  an  experiment.  In  contrast,  MD  can  eas¬ 
ily  track  any  moment  of  collisions  and  can  recover  macroscopic  phenomena  such  as  defor¬ 
mations,  vibrations  of  particles  from  microscopic  quantities.  It  can  show  the  coefficient  of 
restitution  as  well,  and  can  even  connect  instantaneous  phenomena  and  coefficient  of  resti¬ 
tution.  Unfortunately,  due  to  computational  limitations,  MD  can  calculate  particles  of  atoms 
far  smaller  than  macroscopic  particles  within  a  finite  computation  time. 

While  some  nanoparticle  collision  simulations  have  been  studied  by  other  researchers 
utilizing  MD,  they  are  limited  to  the  elastic  collision  regime  or  very  high  collision  veloc¬ 
ity  regime  and  do  not  analyze  the  details  of  the  energy  balance.  Most  studies  using  MD 
have  dealt  with  collisions  of  a  particle  onto  a  flat  surface,  while  a  few  studies  have  reported 
particle-particle  collisions.  For  the  latter  case,  Kalweit  et  al.  [4,  3]  have  reported  dynamics 
of  collisions  of  large  particles  up  to  10,973  atoms  at  high  collision  velocities  so  that  particles 
coalesce  or  beak  into  fragments.  At  the  other  extreme,  Kuninaka  et  al.  [5,  6]  have  shown 
that  a  nanoparticle’s  coefficient  of  restitution  for  682  Lennard-Jones  atoms  in  a  particle  can 
exceed  unity  at  very  low  impact  velocities. 

In  this  study,  we  utilize  LAMMPS,  an  MD  code  developed  by  Sandia  National  Labora¬ 
tories,  to  simulate  collisions  of  two  identical  spherical  nanoparticles.  The  largest  particle  we 
report  results  for  contains  44,403  atoms.  We  present  coefficient  of  restitution  of  the  particles 
over  a  wide  range  of  impact  velocities  so  that  the  particles’  behavior  ranges  from  elastic  defor¬ 
mation  to  plastic  deformation.  We  compare  the  calculated  coefficient  of  restitution  with  that 
derived  by  Johnson’s  macroscopic  plastic  deformation  theory.  We  also  show  energy  balance 
before  and  after  collision  and  relate  it  to  the  coefficient  of  restitution. 

2.  Simulation  Method.  The  Nanoparticles  considered  here  are  approximate  spheres  of 
various  radii  R  constructed  from  face-centered  cubic  (fee)  lattice  of  Lennard-Jones  atoms.  We 
generate  seven  different  sizes  of  nanoparticles  ranging  from  603  to  44, 403  atoms  in  a  single 
particle,  which  correspond  to  approximately  R  =  5.4cr  to  R  =  22.1  cr.  The  particles  are  not 
perfect  spheres;  there  exist  faceted  faces  in  the  particles  due  to  the  underlying  fee  lattice  of 
atoms.  To  simulate  collisions,  two  identical  particles  are  placed  with  facets  facing  each  other 
(see  Fig.  4.6(a))  and  they  are  given  equal  and  opposite  center-of-mass  velocities. 

We  use  classical  MD  for  this  study  and  consider  nanoparticles  constituent  atoms  whose 
potential  is  expressed  by  Lennard-Jones  (L-J)  potential: 


v(nj)  = 


o 


faiy  <  rc), 
fay  >  rc). 


(2.1) 


where  rtj  is  an  inter-atomic  distance  between  atoms  labeled  by  i  and  j,  and  rc  is  a  cut-olf 
distance.  We  set  rc  =  2.5 cr  for  atoms  interacting  within  a  single  particle,  which  is  typically 
used  for  the  L-J  potential.  In  addition,  we  ignore  particle-particle  adhesive  interactions,  and 
therefore,  we  set  rc  =  21/6<x  (purely  repulsive)  for  interactions  between  atoms  belonging  to 
different  particles. 

Note  that  the  units  used  here  are  L-J  units,  i.e.,  the  unit  for  energy:  6,  the  unit  for  distance: 
cr.  In  addition  to  these  units,  the  unit  for  velocity:  y/e/m,  the  unit  for  temperature:  e/ks, 
where  k#  is  Boltzmann  constant,  and  the  unit  for  time:  1/  ^ e  Inter 2  are  used. 

The  full  simulation  procedure  follows  a  number  of  steps.  First,  two  identical  parti¬ 
cles  are  placed  and  thermally  equilibrated  over  10,000  MD  steps,  controlling  temperature 
T  =  0.02 e/ks  using  the  Nose-Hoover  thermostat.  After  reaching  an  equilibrium  state,  the 
particles  are  given  equal  and  opposite  center-of-mass  velocities  and  are  thus  set  to  collide  with 
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a  relative  collision  velocity  of  vcoii.  The  collision  process  is  carried  out  in  the  microcanonical 
ensemble  (NVE).  For  most  system  sizes,  vcon  is  varied  from  0.03  ^e/m  to  3.3  yfejm,  where  m 
is  mass  of  an  atom.  The  time  step  is  set  to  be  At  =  0.005/  yje/mcr2  for  all  cases.  The  relative 
residue  of  the  total  energy  of  a  system  during  calculations  is  conserved  to  around  10-5.  This 
is  achieved  for  finite  cutoffs  by  shifting  the  potential  to  zero  at  rc.  In  order  to  quantify  errors 
in  calculated  quantities,  we  perform  multiple  simulations  for  each  collision  velocity  by  set¬ 
ting  different  initial  thermal  velocities.  The  systems  and  number  of  independent  runs  carried 
out  for  ensemble  averaging  are  given  in  Table  2.1. 

Table  2.1 

The  systems  for  simulations 


Radius  R  /  cr 

Number  of  atoms  N 

Number  of  data 

5.4 

603 

10 

6.5 

1,055 

10 

8.1 

2,093 

10 

11.4 

5,481 

10 

14.6 

11,849 

10 

18.9 

25,315 

1 

22.7 

44,403 

10 

3.  Background. 

3.1.  Coefficient  of  restitution.  The  coefficient  of  restitution  e  is  defined  by  momentum 
balance  using  relative  velocity  v'  after  collision  and  relative  velocity  vr  before  collision: 


This  is  a  standard  definition  and  is  widely  used.  This  definition  is  practically  convenient  to 
obtain  coefficient  of  restitution  by  an  experiment  because  to  measure  velocities  of  objects 
before  and  after  collision  is  relatively  easy.  It,  however,  does  not  tell  anything  about  the 
mechanisms  of  energy  dissipation  in  collisions,  especially  regarding  the  internal  degrees  of 
freedom.  In  order  to  investigate  these  details  of  collisions,  it  is  more  advantageous  to  consider 
energy  balance  substituting  in  the  coefficient  of  restitution  as  defined  by  momentum  balance. 
Our  study  utilizes  classical  MD  which  computes  microscopical  quantities  (internal  degrees  of 
freedom),  positions  and  velocities,  of  each  individual  atom  and  allows  us  to  recover  macro¬ 
scopic  quantities  such  as  the  center-of-mass  translational  kinetic  energy  of  the  particle  and 
temperature  of  the  system. 

Consider  the  energy  balance  of  a  collision  process.  The  total  energy  E  of  the  system 
is  composed  of  the  energy  associated  with  average  motion  of  the  particles:  center-of-mass 
translational  kinetic  energy  Kcm  and  rotational  kinetic  energy  around  center-of  mass  of  each 
particle  KTO t  as  well  as  other  forms  of  macroscopic  kinetic  energy  Ke\se  (e.g.,  vibrational 
modes  of  an  elastic  sphere);  and  energy  of  internal  modes  (e.g.,  fluctuations  about  the  average 
motion,  and  internal  energy  of  particles)  £j nt. 

E  =  Kc  m  +  ^rot  +  ^else  +  ^int-  (3-2) 

Since  we  perform  simulations  in  the  microcanonical  ensemble,  the  system’s  total  energy  is 
conserved.  Therefore,  when  the  time  elapses  long  enough  after  collision,  the  energy  differ¬ 
ence  before  and  after  collision  is  zero: 


E'  —  E  —  AKcm  +  AKTOt  +  AKq\Sq  +  AEint  —  0. 


(3.3) 
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The  coefficient  of  restitution  in  the  center  of  mass  reference  frame  can  be  written  as 


4 

]Kcm 


(3.4) 


Using  the  kinetic  energy  loss  AATcm  =  K'cm  -  Kcm ,  it  follows  that 


e  =  ■ 


/  ^cm  AATrot  AA'eise  AE\ 


Ka 


A(  A)ot  +  ATeisc  +  ^int) 


kca 


(3.5) 


The  term  A(Kmt  +  ^eise  +  iijnt)  represents  loss  of  translational  kinetic  energy  during  col¬ 
lision.  In  general,  for  head-on  collisions  AKmt  =  0.  In  this  study,  however,  we  consider 
rotational  kinetic  energy  about  center  of  mass  of  each  particle.  This  is  necessary  even  though 
we  consider  nominally  head-on  collisions;  results  shown  in  Fig.  4.3(b)  indicate  some  parti¬ 
cles  rotate  around  their  own  center  of  mass  after  collision.  This  is  possible  since  the  particles 
before  collision  have  thermal  (fluctuating)  velocities;  hence  they  can  acquire  rotational  en¬ 
ergy  due  to  the  symmetry  breaking  in  collision  from  fluctuations  of  individual  atoms.  We, 
therefore,  include  rotational  kinetic  energy  in  the  energy  balance. 

3.2.  Yielding  velocity.  Johnson  [2]  has  derived  a  relationship  for  the  velocity  at  the 
onset  of  material  yield  and  plastic  deformation  vY , 


\Mvi  = 


53  R'3Y5 
E *4  ’ 


(3.6) 


where  Y  is  yield  stress  of  a  particle,  M  =  Nm/2,R'  =  R/2,  E*  =  £’/[2(l  -  v2)],  Nm,R ,  E  are 
the  mass,  radius,  and  Young’s  modulus,  and  v  is  Poisson’s  ratio.  The  theoretical  yield  stress 
for  an  ideal  fee  lattice  is  given  by 


G_ 

To’ 


(3.7) 


where  G  is  the  shear  modulus.  Quesnel  et  al.  determined  Young’s  modulus,  Poisson’s  ratio, 
and  shear  modulus  in  [100]  direction  for  the  fee  Lennard- Jones  solid  using  molecular  dynam¬ 
ics  [7].  Their  values  E  =  61.1  e/cr3,G  =  51.2e/cr3,  v  =  0.347  and  Eq.  3.6  give  the  yielding 
velocity  for  fee  lattice, 


v7  =  0.945 


(R/o-)3 


N 


e  m. 


(3.8) 


Note,  the  yielding  velocity  does  not  depend  on  the  particle  size  because  the  number  of  atoms 
N  grows  cubically  as  the  radius  of  a  particle  R  increases,  then  the  ratio  R3 /N  in  Eq.  3.8  stays 
constant. 

4.  Results. 

4.1.  particle  temperature  pre-collision  and  post-collision.  The  temperature  T  of  the 
system  is  related  to  the  kinetic  energy  associated  with  velocity  fluctuations  about  the  average 
velocity.  We  calculate  the  temperature  by 


N 

i=  1 


vcmr )  =  ^NkBT, 


(4.1) 
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where  N  represents  the  number  of  atoms  in  a  system,  v,  atom’s  velocity,  and  vcm  center- 
of-mass  velocity  of  the  particle.  To  obtain  the  temperature  of  the  system,  the  center-of-mass 
velocity  of  the  particle  to  which  an  atom  belongs  is  subtracted  from  individual  atom’s  velocity. 
In  Fig.  4.1,  the  bar  graph  shows  the  distribution  of  atom  speeds  obtained  in  our  simulation  for 
N  =  5481,  T  =  0.02 e/kB  after  it  is  equilibrated.  The  continuous  line  depicts  the  theoretical 
distribution  (Maxwellian)  for  the  given  temperature. 


Fig.  4.1.  The  distribution  of  atoms’  speed  shows  Maxwellian  (N  =  5481,  T  =  0.02  e/ks).  The  solid  line  is  for 
the  theoretical  distribution. 


The  velocity  distributions  of  the  atoms  after  collision  are  shown  in  Fig  4.2.  The  v-axis  is 
taken  along  the  collision  axis  of  two  particles,  and  the  other  axes,  y  and  z,  are  perpendicular 
to  it.  The  velocity  distributions  are  very  nearly  Gaussian,  and  little  difference  is  seen  in  the 
three  component  distributions. 

4.2.  Energy  conservation  of  the  system.  The  time  evolution  of  total  energy,  kinetic 
energy,  and  potential  energy  of  the  system  for  N  =  5481,  vcon  =  0.66,  and  0.6  is  shown 
in  Fig.  4.3.  The  calculation  for  equilibrating  the  system  until  10, 000  MD  time  step  is  done  in 
the  canonical  ensemble  (NVT).  After  the  equilibration,  the  particles  are  given  a  translational 
velocity  in  order  to  collide  each  other.  In  this  collision  state,  the  simulation  is  run  in  the  NVE. 
The  fluctuations  of  the  total  energy  up  to  10,000  MD  step  come  from  controlling  system’s 
temperature.  At  10,000  MD  time  step,  the  sudden  step  in  both  total  energy  and  kinetic 
energy  are  due  to  the  kinetic  energy  associated  with  particles’  center-of-mass  motions.  The 
increase  of  potential  energy  and  the  drop  of  kinetic  energy  at  12, 500  MD  time  step  indicate 
the  beginning  of  collision  process.  Both  kinetic  and  potential  energies  long  enough  after 
collision  (time  step  >  20, 000  MD  step)  stay  constant  in  time.  For  this  system,  the  particles 
start  to  rotate  at  the  beginning  of  the  collision,  which  is  shown  a  sharp  peak  in  Fig.  4.3(b), 
and  the  rotational  energies  of  both  particles  are  conserved  from  the  end  of  the  contact  at  about 
16, 000  MD  step.  The  loss  of  translational  kinetic  energy  of  this  system  during  the  collision 
is  roughly  4, 0006,  and  rotational  energy  of  the  system  post-collision  is  about  46,  which  is 
much  less  than  the  translational  kinetic  energy  loss.  Therefore,  the  rotation  of  particles  is 
not  significant  for  this  system.  The  temperature  of  each  particle  is  plotted  on  Fig.  4.3(c). 
The  temperature  is  controlled  to  be  0.02 e/kp  until  10,000  step.  As  the  collision  begins,  the 
temperature  starts  to  increase,  and  at  the  end  of  the  collision  it  reaches  another  equilibrium 
state,  gaining  energy  from  part  of  the  translational  kinetic  energy  loss. 

4.3.  Coefficient  of  restitution.  Now  we  consider  the  coefficient  of  restitution  for  N  = 
5481.  In  Fig.  4.4  three  methods  of  determining  the  coefficient  of  restitution  are  shown.  The 
circles  are  for  the  coefficient  of  restitution  computed  from  center-of-mass  velocities  defined  in 


Y.  Takato,  J.B.  Lechman,  and  S.  Sen 


277 


Velocity  of  atoms  /  (e  /  m)1/2  Velocity  of  atoms  /  (e  /  m)1/2 

(a)  x-component  (b)  ^-component 


(c)  z-component 


Fig.  4.2.  Velocity  distributions  of  a  system,  N  =  5481,  vcoii  =  3.3  sje/m,  after  collision  shows  Gaussian 
distribution. 


Eq.  3.1.  The  triangles  are  for  the  coefficient  of  restitution  from  kinetic  energy  Kcm  and  “loss  of 
energy”  into  rotational  kinetic  energy  and  internal  degrees  of  freedom  A(Kroi  +  Emt).  The  stars 
are  for  the  coefficient  of  restitution  from  translational  kinetic  energy  Kcm  and  “loss  of  energy” 
into  internal  degrees  of  freedom  AE[nt  only,  which  considers  only  gain  in  thermal  kinetic 
energy  as  “loss  of  energy”  as  seen  in  the  macroscopically  observed  reduction  in  translational 
center-of-mass  velocity  of  the  particles.  Both  are  defined  in  Eq.  3.5.  It  can  be  seen  that  the 
coefficient  of  restitution  calculated  from  center-of-mass  velocity  only  compares  well  with  the 
one  that  considers  rotational  kinetic  energy  gain  and  internal  kinetic  energy  gain  as  the  energy 
losses.  The  coefficient  of  restitution  that  counts  only  thermal  energy  as  the  energy  loss  is  a 
bit  higher  at  0.25  <  vcoii  <  0.5  It  is  interesting  that  in  this  region,  particles  that 

rotate  about  their  center  of  mass  are  observed  after  collision,  therefore  this  form  of  “energy 
loss”  must  be  considered. 

The  reason  some  particles  rotate  after  collision  is  that  random  motions  of  atoms  in  a  par¬ 
ticle  exert  force  on  the  other  particle  during  collision.  If  the  direction  of  total  force  between 
the  particles  does  not  coincide  with  center  of  mass,  the  force  generates  torque  to  rotate  the 
particle.  If  many  simulations  are  done  with  different  initial  conditions,  the  direction  of  rota¬ 
tions  should  be  random  since  it  is  derived  from  random  motions.  To  verify  this,  the  initial 
thermal  velocity  is  generated  randomly,  and  ten  simulations  whose  initial  thermal  velocities 
are  different  have  been  performed.  The  results  exhibit  various  rotational  directions,  and  hence 
rotations  of  particles  after  collisions  appear  to  be  consistent  with  random  motions  of  atoms. 

The  coefficient  of  restitution  obtained  in  our  numerical  simulations  for  all  system  sizes 
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x  104 


(a)  Total  energy,  kinetic  energy  and  potential  energy 
of  the  system,  (unit:  e) 


- temperature  of  cluster  1 

—  temperature  of  cluster  2 


0.5  1  1.5  2  2.5 

time  step  x  1  q4  time  step  x  1  q4 

(b)  Rotational  kinetic  energy  around  center  of  mass  of  (c)  Temperature  of  each  particle,  (unit:  e/k#) 
each  particle,  (unit:  e) 


Fig.  4.3.  A  sample  of  energy  transformation.  N  -  5481  and  vcoii  =  2.0  y/e/m. 


is  shown  in  Fig.  4.5.  Fig.  4.5(c)  is  for  just  N  =  5481  taken  out  from  Fig.  4.5(b). 

In  our  simulations,  the  collision  velocity  is  varied  from  vcon  =  0.03  to  3.3  ^e/m. 
In  this  range  of  velocity,  several  regimes  in  the  coefficient  of  restitution  in  terms  of  a  parti¬ 
cle’s  deformation  modes  are  observed:  an  elastic  deformation  regime,  a  plastic  deformation 
regime,  and,  in  addition  to  them,  there  appears  to  be  phase  transition  effects  at  the  higher  end 
of  the  plastic  regime.  The  elastic  region  is  in  vcon  <  0.5  yfejm ,  and  the  plastic  region  is  in 
vcon  >0.5  ^e/m.  The  phase  transition  region  at  which  the  particle  surface  appears  to  melt  is 
at  vcon  >  2.2  yfejm.  There  exists  one  more  significant  transition  at  vcon  1.3  yfejm  that  shows 
distinct  deformation  modes. 

The  coefficient  of  restitution  in  the  elastic  regime  is  closer  to  unity  than  in  the  plastic 
deformation  regime  as  expected.  For  most  system  sizes,  the  coefficient  of  restitution  takes  a 
peak  at  around  vcon  =  0.05  yfejm.  The  system  of  N  =  11,  849  indicates  a  different  behavior; 
the  coefficient  of  restitution  keeps  on  increasing  below  vcoii  =  0.05  y]elm  and  goes  above 
unity.  Although  these  results  are  intriguing,  the  error  bars  on  data  for  vcoii  <  0.015  yje/m  are 
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CD 


O  Velocity  definition 

Za  Energy  definition 

ff  Energy  definition  without  rotational  kinetic  energy 

10 2  10_1  10°  101 
collision  velocity 


Fig.  4.4.  Comparison  with  coefficient  of  restitution  defined  by  energy  and  velocity.  (System  size  N  =  5481 ) 


quite  large,  therefore  further  discussion  should  be  made  after  collecting  more  data  to  reduce 
the  error  bars.  For  a  velocity  of  0.2  sfejm  <  vcoii  <0.33  sfejm,  it  seems  that  the  particle- 
size  dependence  is  observed,  comparing  the  coefficient  of  restitution  at  the  same  velocity, 
Vcoii  =  0.2, 0.26,  and  0.33  sfejm.  As  the  particle  size  gets  larger,  the  coefficient  of  restitution 
becomes  lower. 

Next,  we  look  at  the  plastic  regime,  0.5  sfejm  <  vcon  <3.3  sfejm.  In  this  regime,  dif¬ 
ferent  deformation  modes  are  observed,  dislocation  and  large  deformation  as  well  as  sur¬ 
face  melting.  There  appear  to  be  several  kinks  at  vcon  =  0.4, 1.0, 2.0  and  2 .6s[ejm.  For 
0.5  sfejm  <  vcoii  <  1.0  s/e/m,  particles  with  N  =  5481  have  a  single  dislocation  along  its 
crystal  face  (see  Fig.  4.6(b)),  while  particles  with  N  =  44, 403  have  multiple  dislocations 
due  to  the  collision.  In  the  range  of  1 .2  <  vcoii  <  2.0  yje/m,  the  collision  causes  large 

deformation  of  the  particles,  and  they  have  more  than  simple  dislocation  bands;  their  shapes 
are  no  longer  spherical  (see  Fig.  4.6(c)).  In  the  case  of  collision  velocity  vcon  >  2.0  sfefm, 
particles  have  very  little  of  their  original  crystal  structure  left  after  collision.  Atoms  on  the 
surface  move  freely,  and  the  surface  of  particles  melts  (see  Fig.  4.6(d) ).  Thus,  the  plateau  at 
about  vcoii  =  2.0  sfejm  seems  to  coincide  with  a  phase  transition  of  L-J  atoms. 

In  the  plastic  regime,  the  coefficient  of  restitution  appears  to  be  inversely  proportional 
to  the  collision  velocity  to  some  power  (i.e.,  the  coefficient  of  restitution  follows  a  power 
law  with  exponent  -1).  Therefore,  it  is  expressed  in  a  form  of  e  =  cv"oll.  In  Table  4.1,  the 
exponent  and  coefficient  for  each  region  between  adjacent  separated  by  kinks  are  shown.  A 
theoretical  approach  for  deriving  the  relation  between  the  coefficient  of  restitution  and  the 
collision  velocity  of  macroscopic  particles  that  plastically  deform  has  been  carried  out  by 
Johnson  [2]  ,  and  gives: 


e  =  cv 


-1/4 
coll  ’ 


(4.2) 


where  d  is  determined  by  the  geometry  and  material  properties  of  the  particles.  The  co¬ 
efficient  of  restitution  depends  on  the  -1/4  power  of  collision  velocity,  which  differs  from 
our  results.  The  possible  reasons  for  the  disagreement  may  stem  from  the  system  sizes,  the 
L-J  potential,  or  the  NVE  ensemble.  The  system  sizes  of  our  simulations  are  much  smaller 
than  macroscopic  ones.  If  the  sizes  are  the  major  cause  of  the  disagreement,  to  increase  the 


280 


Nanoparticle  collision 


sizes  of  simulations  would  resolve  it.  We  are  carrying  out  much  larger  calculations  to  de¬ 
termine  this.  The  L-J  potential  is  one  of  the  simplest  potential  functions,  and  it  may  yield 
unrealistic  simulation  results  when  compared  to  other  materials.  To  employ  more  realistic 
potential  function  may  reproduce  results  of  coefficient  of  restitution  close  to  the  theoretical 
ones.  Finally,  the  NVE  ensemble  keeps  the  system  energy  constant,  as  a  result,  the  loss  of  the 
center-of-mass  translational  kinetic  energy  is  considerably  converted  into  the  system’s  tem¬ 
perature  as  shown  in  Fig.  4.3(c).  However  the  energy  of  a  real  system  during  collision  may 
not  be  kept  constant.  In  reality,  the  initial  kinetic  energy  could  be  sucked  out  and  transferred 
into  the  environment  during  the  collision,  and  the  system’s  temperature  may  not  increase  as 
much  as  the  temperature  that  the  figure  shows.  A  more  realistic  model  that  shows  energy 
dissipation  into  the  environment  could  lead  to  higher  restitution  of  coefficient.  Simulations 
using  the  NVT  ensemble  that  keeps  system’s  temperature  constant  would  incorporate  similar 
effect  of  the  energy  dissipation  and  may  improve  the  coefficient  of  restitution  in  the  plastic 
deformation  regime.  We  are  also  performing  simulations  with  a  weak  Langevin  thermostat 
applied  to  NVE  to  determine  the  nature  of  this  effect. 


Table  4.1 

Theoretical  and  computational  coefficient  and  exponent  for  the  power  law,  e  =  cvn 


Vcoii  /  A lejm 

Simulation 

Exponent  n  Coefficient  c 

Theory 
Exponent  n 

0.5-1. 0 

-0.91 

0.34 

-1/4 

1. 3-2.0 

-1.79 

0.47 

-1/4 

2. 6-3. 3 

-2.09 

1.16 

-1/4 

The  yielding  velocity  obtained  in  our  system  is  about  0.5  for  N  =  5481  in  Fig. 
4.5(b).  Although  we  need  more  data  points  around  the  yielding  velocity  to  determine  it 
precisely,  it  appears  that  all  system  sizes  show  the  same  yielding  velocity.  The  theoretical 
yielding  velocity  derived  by  Johnson  is  obtained  using  Eq.  3.8,  and  we  get  vy  =  0.166  yje/m 
for  our  fee  lattice  particles  made  of  L-J  atoms.  This  value  is  smaller  than  our  computational 

Table  4.2 

Yielding  velocity  vy  obtained  from  the  theory  and  our  simulations. 


Simulation 

Theory 

vy/  yfefm  0.5 

0.166 

value,  but  the  same  order  of  magnitude.  This  is  decent  agreement,  considering  the  use  of  the 
rough  estimation  of  yielding  stress  Y  in  Eq.  3.7. 

5.  Conclusions.  We  have  simulated  nanoparticle  collisions  via  the  molecular  dynamics 
code  LAMMPS  and  we  have  obtained  coefficients  of  restitution  for  two  identical  particles  up 
to  the  system  size  N  =  44, 403  in  a  single  particle  under  various  collision  velocities.  We  see 
distinct  regimes  in  the  coefficient  of  restitution  ranging  from  elastic  to  plastic  deformations. 
We  have  shown  that  there  exists  a  peak  of  coefficient  of  restitution  in  elastic  region,  and 
there  are  three  distinct  regions  in  plastic  deformation;  dislocations  of  atoms  in  a  particle, 
large  deformations  of  a  particle,  phase  transition  effects.  There  exists  large  discrepancy  in  the 
behavior  of  the  coefficient  of  restitution  comparing  nanoparticle  and  macroscopic  particle, 
and  hence  further  simulations  for  larger  systems  will  be  performed  to  examine  the  size  effect 
as  well  as  simulations  of  amorphous  particles. 
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Nanoparticle  collision 


Collision  velocity 


(a)  Coefficient  of  restitution  in  linear-linear  scale  (Note:  horizontal  axis  scale  is 
different  from  4.5(b).  ) 


Collision  velocity 


(b)  Coefficient  of  restitution  in  log-log  scale 


Collision  velocity 
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(a)  Before  collision. 


(b)  After  collision.  vcon  =  0.4  ^e/m.  The  dislocations 
along  a  single  lattice  face  are  seen  in  each  particle. 


(c)  After  collision.  vcoii  =  1.3  sfejm.  The  particles  are  no  (d)  After  collision.  vcou  =  3.3  yje/m.  The  atoms  on  sur- 
longer  spheres.  faces  move  freely. 


Fig.  4.6.  Snap  shots  of  particles  (N  =  5481)  before  and  after  collision  at  vcoii  =  0.4, 1.3, 3.3  sje/m. 
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NUMERICAL  SIMULATION  OF  THE  PERFORMANCE  OF  A  RESONANT 

TUNNELING  DIODE 

ANNE  S.  COSTOLANSKI*  AND  ANDREW  G.  SALINGER1^ 

Abstract.  A  method  for  modeling  the  performance  of  a  resonant  tunneling  diode  (RTD)  using  Sandia’s  Trilinos 
software  is  described.  The  equations  to  model  the  behavior  of  an  RTD  are  given,  along  with  the  corresponding 
numerical  methods.  An  object-oriented  structure  using  C++  classes  is  defined,  and  the  method  for  incorporating 
Trilinos ’s  nonlinear  solver  and  continuation  packages  is  detailed. 


1.  Introduction.  Resonant  tunneling  diodes  (RTDs)  are  nanoscale  semiconductor  de¬ 
vices  which  were  first  proposed  by  Tsu  and  Esaki  in  1973  [15].  They  predicted  a  phenomenon 
known  as  negative  differential  resistance,  in  which  the  current  through  a  tunneling  barrier 
reaches  a  local  maximum  when  the  electrons  injected  in  the  material  achieve  certain  resonant 
energies.  In  1974,  Chang  et  al.  fabricated  the  first  RTD  device  that  demonstrated  evidence  of 
negative  differential  resistance  [3].  A  decade  later,  Sollner  et  al.  improved  upon  previously 
achieved  results  [14],  and  research  on  RTDs  increased. 

Resonant  tunneling  diodes  have  two  distinct  features  that  make  them  stand  out  from 
other  semi-conductor  devices:  their  high  speed  operation  and  their  ability  to  produce  negative 
differential  resistance.  Since  tunneling  is  a  very  fast  phenomenon  (response  time  is  on  the 
order  of  picoseconds),  the  RTD  is  among  the  fastest  devices  ever  made,  with  a  maximum 
operational  oscillation  frequency  projected  to  be  over  1  THz  [11].  Due  to  these  features  and 
their  potential  to  be  used  as  components  in  high  speed  electronic  devices,  RTDs  have  been 
studied  extensively  since  the  mid  1980s  [5,  6,  8,  13,  17,  18]. 

In  this  paper,  we  propose  an  updated  method  for  simulating  the  performance  of  an  RTD 
using  C++  and  Trilinos.  The  paper  is  organized  as  follows:  Section  2  describes  the  basic 
structure  of  an  RTD,  section  3  details  the  equations  that  model  an  RTD,  and  section  4  specifies 
how  these  equations  are  discretized.  Section  5  discusses  how  to  implement  the  discretized 
equations,  with  section  6  listing  the  details  of  the  C++  classes  and  section  7  the  specific 
Trilinos  implementation.  Our  conclusions  are  listed  in  section  8,  with  future  upgrades  to  the 
code  discussed  in  section  9. 

2.  Resonant  Tunneling  Diodes.  A  typical  resonant  tunneling  diode  is  composed  of  the 
following  sections  (see  figure  2.1): 

•  Two  regions  of  narrow  energy  band  gap  material  that  are  heavily  infused  with  elec¬ 
trons  on  either  end  of  the  device 

•  Several  layers  (barriers)  of  a  larger  energy  band  gap  material  on  the  interior  of  the 
device,  separated  from  the  doped  regions  by  thin  undoped  spacer  regions 

•  Quantum  well  region(s)  separating  the  barrier  regions  from  each  other 

The  doped  regions  on  either  end  of  the  device  are  generally  large  in  size  in  comparison  to 
the  barrier  and  well  regions.  In  a  typical  RTD,  the  quantum  well  thickness  is  around  50A,  and 
the  barrier  layers  can  range  from  15  to  50A.  The  region  in  the  interior  of  the  device  containing 
the  well(s),  barrier  layers,  and  spacer  regions  has  a  significantly  lower  doping  level  than  the 
heavily  doped  regions  on  each  end  of  the  device.  The  spacer  regions  separating  the  barriers 
from  the  doped  regions  ensure  that  dopants  do  not  diffuse  to  the  barrier  layers.  Bias  is  then 
applied  across  the  device  to  induce  current  flow. 
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doped  canlarti 


Fig.  2.1.  Energy  band  diagram  of  a  typical  two  barrier  resonant  tunneling  diode,  where  the  left  endpoint  is  at 
the  beginning  of  the  device  (i.e.,  corresponding  to  a  position  at  x  =  0),  and  the  right  endpoint  is  at  the  end  of  the 
device  (i.e.,  corresponding  to  a  position  of  x-  L,  where  L  is  the  length  of  the  device .)  From  [11]. 


3.  The  Wigner-Poisson  Equations.  Since  RTDs  are  built  on  the  nanoscale,  their  physics 
is  governed  by  quantum  mechanics  rather  than  classical  physics.  Thus,  the  standard  drift- 
diffusion  model  no  longer  applies  in  calculating  the  current  in  the  device.  One  of  the  primary 
models  used  for  simulating  the  performance  of  a  resonant  tunneling  diode  is  the  Wigner- 
Poisson  formulation,  which  couples  the  Wigner  equation  with  Poisson’s  equation  to  predict 
the  behavior  of  nanoscale  semiconductor  devices. 

The  Wigner  function  f(x,  k ,  t),  which  describes  the  distribution  of  electrons  in  the  device, 
can  be  derived  from  the  density  matrix  p(z,z',t)  for  a  multi-particle  quantum  mechanical 
system: 


p(z,  z, ?)  =  (3-1) 

j 

where  is  an  ensemble  of  wavefunctions  that  satisfy  Schrodinger’s  equation,  z,  z!  are  two 
positions  in  coordinate  space,  t  is  time,  and  f(j)  is  the  Fermi-Dirac  distribution  function, 
which  gives  the  probability  that  an  electron  will  be  in  a  given  quantum  mechanical  state.  The 
Wigner  function  is  related  to  the  density  matrix  via  a  fourier  transform  using  a  change  of 
variables  with  v  =  \(z  +  z')  and  y  =  z  -  z\  so  that 

X°°  1  1 

e~,kyp(x  +  -y,  x  -  -y,  t)dy  (3.2) 

where  x,  y  are  position  variables  and  k  is  momentum.  Taking  the  time  derivative  of  the  Wigner 
function  and  simplifying  using  trigonometric  identities  and  the  change  of  coordinates,  we 
have 


df(x ,  k ,  t) 
dt 


hk  df(x ,  k ,  t) 
2twi*  dx 


f(x ,  k' ,  t)dk' 


sin(2y(&  -  k'))[U(x  +  y)  -  U(x  -  y)]dy 


=  K(f)  +  P(f). 


(3.3) 


Here  K(f)  is  defined  as  the  kinetic  energy  term 


K(f)  =  - 


hk  df 


Innf  dx 


(3.4) 
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which  corresponds  to  the  effects  due  to  kinetic  energy  on  the  distribution  function  /,  where 
h  is  Planck’s  constant  and  m*  is  the  effective  mass  of  an  electron.  The  other  term  in  the 
derivation,  P(j),  is  the  potential  term 

P(f)  =  - 7  f  f(x,  k',t)T(x,k-k')dk'  (3.5) 

h  J-oc 


with 


T(x,  k-k')=  [U(x  +  y)-  U(x  -  y)]  sin(2 y(k  -  k'))dy  (3.6) 

Jo 

where  U(x)  is  the  potential  energy  inside  the  device.  To  determine  the  potential  energy,  we 
need  to  incorporate  both  the  energy  band  function,  Ac(x),  defined  by  the  barriers  and  wells 
within  the  device  as  well  as  the  electrostatic  potential,  up(x),  to  account  for  the  voltage  applied 
to  the  device.  Thus 


U(x)  =  Ac(x)  +  up(x ) 


(3.7) 


where  up(x)  is  the  solution  to  Poisson’s  equation 


d2up(x ) 
dx2 


-  [Nd(x)  -  »(*)] 


(3.8) 


with  q  the  charge  on  a  electron,  e  the  dielectric  constant  (which  is  material  dependent),  Nd(x) 
the  doping  profile  of  the  device,  and  n(x)  the  electron  density  function,  defined  as 

n(x)  =  T  f  f{x,k)dk.  (3.9) 

27T  «, 


Poisson’s  equation  has  boundary  conditions  dependent  on  the  amount  of  voltage  applied 
across  the  device 


up(  0)  =  Vo,  up(L)  =  VL  (3.10) 

where  Vo  is  the  initial  voltage  at  the  left  side  of  the  device,  and  VL  the  voltage  at  the  right 
side.  Traditionally  Vo  =  0  and  \L  -  -V  with  V  >  0. 

Complete  details  of  the  derivation  are  shown  by  Buot  and  Jensen  [2].  This  derivation, 
however,  does  not  take  the  effects  of  electron  interactions  into  account.  Several  different 
methods  have  been  tried  to  incorporate  scattering  effects  into  the  equation,  but  detailed  treat¬ 
ments  create  a  heavy  computational  burden  [1].  Thus  we  will  use  a  low-order  relaxation-time 
approximation  in  this  model  [16]  to  account  for  collision  interactions  between  electrons  in 
the  device: 


S(f)  =  - 

T 


/o(x,  k ) 


LZ  f°(x’  k)dk 


f 


f(x ,  k) dk  -  f(x ,  k ) 


(3.11) 


Here  S(j)  is  termed  the  scattering  function,  with  r  the  relaxation  time  of  an  electron  in 
the  device.  fo(x,  k)  is  the  equilibrium  Wigner  distribution  function,  which  is  the  solution  to 
equation  (3.3)  with  no  change  in  the  bias  voltage  applied  across  the  device;  i.e.,  Vl  =  Vo. 
Thus  the  full  Wigner  equation  is 


df(x,  k ,  t) 


=  K{f)  +  P(J)  +  S(J)  =  W(f ), 


dt 


(3.12) 
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and  to  solve  the  Wigner  equation,  boundary  conditions  are  imposed  on  the  distribution  func¬ 
tion  /: 


/(0,  k)  =  47r'”  2^7  In  ■{  1  +  exp 
f(L,k)  =  6,71,11  ^  In \  1  +  exp 


1  / 
kBT  \Sn2m* 
1  /  h2k2 


kBT  \8 7i2m* 


,k>  0 
,£<  0 


(3.13) 

(3.14) 


where  kB  is  Boltzmann’s  constant,  T  is  the  temperature,  and  yUo  and  jdL  are  the  Fermi  energies 
at  the  corresponding  ends  of  the  device. 

To  analyze  the  performance  of  the  device,  we  will  drop  the  time  dependence  and  focus 
on  the  steady  state  solution,  since  RTDs  have  an  incredibly  fast  response  time  (on  the  order 
of  picoseconds)  and  we  are  interested  in  the  long-term  behavior  (t  >  1  sec).  The  main 
quantity  of  interest  in  modeling  an  RTD  is  the  current  density,  which  is  the  measure  of  device 
performance.  The  current  density  can  be  calculated  from  the  Wigner  distribution  /(x ,  k)  as 


j(x)  = 


h  r 

2nm*  J—c 


kf(x ,  k)  dk. 


(3.15) 


4.  Discretizing  the  Wigner-Poisson  equations.  Since  the  system  of  equations  specified 
in  the  Wigner-Poisson  formulation  is  too  difficult  to  solve  analytically,  numerical  techniques 
are  used  to  approximate  the  solution.  First,  the  domain  of  the  function  is  discretized  using  a 
uniform  grid.  The  spatial  domain  is  divided  into  Nx  equally  spaced  grid  points,  with  x\  =  0 
and  xNx  =  L,  and  the  grid  spacing  defined  as  Ax  =  . 

To  discretize  the  momentum  space,  the  domain  is  truncated  from  (-ex?,  ex?)  to 
( -Kmax ,  Kmax)  where  Kmax  is  chosen  such  that  when  \k\  >  Kmax ,  /(x,  k)  ~  0.  We  will  use 
Kmax  =  0-25  inverse  Angstroms  as  a  best  approximation,  based  on  prior  work  [4,  9].  Then  the 
momentum  domain  can  be  divided  into  Nk  equally  spaced  grid  points,  with  the  grid  spacing 
given  by  A k  =  .  We  will  avoid  using  k  =  0  as  a  grid  point,  since  at  k  =  0,  the  kinetic 

term  would  be  identically  equal  to  zero,  making  any  discretized  matrix  singular.  Thus  the 
momentum  grid  points  can  be  defined  as 


kJ  = 


(2j-Nk-l)Ak 


for  j  =  1,2, ...,  Nk. 


(4.1) 


We  will  denote  a  solution  of  the  Wigner  distribution  function  /  at  a  grid  point  (jq,  kj)  as  fy. 

To  approximate  the  kinetic  term  K(j),  we  will  use  an  up  winding  scheme  since  there  are 
one-sided  boundary  conditions  on  the  Wigner  function.  The  standard  second  order  approxi¬ 
mation  is 


K(fij) 


(  __Ol  ( 

-3  fj  +  4fi-hj  -  7-2, j  \ 

)  2 nm*  \ 

2Ax  ) 

(  3fij  —  4fi+ij  +  7+2,;  \ 

V  2nm* 

V  2Ax  ) 

kj<0 
kj  >  0. 


(4.2) 


A  first  order  Euler  approximation  is  used  at  x  =  Ax  or  x  =  L  -  Ax  as  appropriate. 

The  potential  P(f)  term  is  discretized  using  the  second  order  composite  trapezoidal  rule 


Nk 


P(fij)  ~  7  ^  fij'Tjxj,  kj  kf)wj 


(4.3) 


/= i 

where  the  wy  are  weighting  terms  that  are  defined  as: 


W,v  = 


A k  for  y  =  2, 3, ...,  Nk  -  1, 

f  for  j'  =  l,  Nk 


(4.4) 
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For  the  T(xi,kj  -  ky)  term,  we  must  make  additional  approximations  to  discretize  this 
integral.  Since  the  upper  limit  of  the  integrand  in  equation  (3.6)  is  infinity,  we  must  truncate 
the  limit  at  a  value  Lc  <  L,  where  Lc  is  termed  the  correlation  length.  Lc  may  or  may  not 
correspond  to  a  grid  point  in  the  spatial  domain,  so  the  weights  used  in  the  approximation 
for  T(xi,kj  -  ky)  must  take  this  into  account  in  order  for  the  solution  to  be  second  order 
accurate.  A  derivation  of  these  weights  can  be  found  in  [9].  Thus,  the  T(xf,  kj  -  ky)  term  can 
be  discretized  as: 

Nc+l 

T(xi,kj  -  ky)  ~  ^  [U(xt  +  xt>)  -  U(xi  -  Xi>)\  sin(2 x^(kj  -  ky))w ?  (4.5) 

i'=  1 


where  the  w?  are  the  modified  trapezoidal  weights 


Ajc 

2 

for  V  =  1 

Av 

fori'  =  2,3,... 

A  x+hNc  hNchNc+ 1 

2  1  2Ax 

for  i '  =  Nc 

h Nc 

2Ax 

for  i'  =  Nc  +  1 

To  compute  the  electrostatic  potential  up(x)  which  is  part  of  the  potential  energy  U(x), 
Poisson’s  equation  can  be  discretized  using  a  center  difference  formula,  for  which  the  second 
order  method  is 


Mp(Xi-i)  2up(Xi)  +  Up^Xi+i) 


—  [Nd(Xi)  -  n(Xi )] 


(4.7) 


with  up{ 0)  =  Vo  and  up(L)  =  VL.  Within  Poisson’s  equation,  the  doping  density  Nd(x)  is 
piecewise  linear,  but  the  electron  density  is  a  function  of  the  Wigner  distribution  /.  Thus,  it 
can  be  approximated  using  the  composite  trapezoid  rule 


(4.8) 


where  the  weights  are  the  standard  composite  trapezoidal  weights  from  equation  (4.4). 

The  final  term  in  the  Wigner  equation  is  the  scattering  term  S'  (/).  We  again  discretize 
the  integrals  using  the  composite  trapezoidal  rule  and  the  associated  weights  from  equation 
(4.4): 


S(fij) 


fo(Xi,kj) 


Nk 


Yjfij'Wj’-fij 


Once  the  Wigner  function  has  been  computed,  the  current  density  j(x )  can  be  calculated 


(4.9) 


j(Xi ) 


h 


Nk 


2  nnf 


kjfij 


(4.10) 


j=  i 


where  the  weights  are  again  those  for  the  standard  composite  trapezoidal  rule. 

5.  Numerical  Implementation.  The  Wigner  equation  (3.12)  is  a  standard  nonlinear 
equation  which  can  be  solved  using  Newton’s  method,  which  finds  a  stationary  point  of  the 
iterative  equation  Ui+\  =  Ui  -  W'(ut)~l  W(ut),  where  W'(ui)  is  the  Jacobian  of  W(u)  at  w*.  When 
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\\ui+i  -Mill  <  e  where  e  is  an  error  tolerance,  w;+i  is  an  approximate  fixed  point  of  the  Wigner 
equation. 

However,  creating  the  iteration  step  W'(ui)~lW(ui)  for  the  discretized  Wigner  equation 
is  not  computationally  efficient,  particularly  when  the  W'(ui)  matrix  is  large  (typical  imple¬ 
mentations  have  on  the  order  of  125,000  to  1  million  grid  points,  which  can  put  the  Jacobian 
matrix  at  approximately  1  trillion  elements).  Instead,  we  will  use  an  inexact  Newton  method 
to  solve  the  system.  (For  more  information  on  inexact  Newton  methods,  see  [7]  and  refer¬ 
ences  therein.)  With  inexact  Newton  methods,  the  nonlinear  equation  is  approximated  by 

\\W'(ui)s  +  W(ui)\\  <  r]i\\W(ui)\\  (5.1) 

where  rji  is  a  forcing  term  that  controls  the  size  of  the  relative  residual.  Newton’s  method 
is  used  to  compute  the  ut,  and  the  step  s  is  approximated  by  a  linear  iterative  method.  For 
Wigner-Poisson,  we  will  use  GMRES  to  solve  for  the  steps,  since  GMRES  is  a  Krylov  sub¬ 
space  method  for  solving  non- symmetric  systems  that  does  not  require  the  computation  of  an 
iteration  matrix  [7]. 

This  method  can  be  efficiently  implemented  using  Sandia’s  Trilinos  software,  which  has 
a  variety  of  built-in  inexact  Newton  method  solvers,  including  Newton-GMRES.  In  addition, 
Trilinos  allows  the  option  to  use  a  matrix-free  implementation  involving  finite  differences 
to  approximate  the  Jacobian- vector  product  W'(ui)s ,  since  computing  and  discretizing  the 
Jacobian  of  the  Wigner-Poisson  formulation  would  be  extremely  complicated. 

6.  C++  Class  Descriptions.  Since  Trilinos  is  written  in  C++,  the  main  program  code 
is  also  written  in  C++  to  take  advantage  of  the  object  oriented  structure.  A  number  of  classes 
and  structs  were  created,  each  representing  a  different  piece  of  the  Wigner-Poisson  equations. 
Each  class  has  a  Compute  function  to  calculate  either  the  appropriate  action  on  the  Wigner 
function  /  or  to  compute  terms  that  will  be  used  in  calculating  /.  The  two  classes  that 
involve  finite  difference  approximations  for  the  kinetic  term  and  the  Poisson  term  each  have 
an  Operator  function  that  sets  up  the  sparse  matrix  associated  with  the  derivative  term  in  the 
equation. 

Data  Inputs  Three  structs  and  two  classes  hold  the  constants  associated  with  the  Wigner- 
Poisson  formulation.  Note  that  the  magnitudes  of  the  data  in  each  is  dependent  on  the  units 
used. 

The  three  structs  are  Constants,  MatProps,  and  DevProps,  which  contain  standard 
mathematical  constants,  constants  related  to  the  materials  used,  and  constants  related  to  the 
device,  respectively.  Most  of  the  C++  classes  reference  one  or  more  of  the  structs  as  part 
of  their  Compute  method.  The  two  classes,  Barrier  and  Doping,  create  the  barrier  and 
doping  profiles  that  are  used  in  the  solution  of  the  potential  term  P(j).  The  Compute  method 
calculates  the  relative  energy  band  profile,  Ac(x),  and  doping  profile  Nd(x)  at  each  grid  point 
along  the  length  of  the  device,  which  are  used  in  equations  (4.5)  and  (4.7)  respectively. 

Wigner  classes  The  next  set  of  classes  were  created  to  calculate  the  specific  terms  in  the 
Wigner  distribution  function: 

•  KineticMethod  for  the  kinetic  term  K(f ) 

•  ScatterMethod  for  the  scattering  term  S  (/) 

•  Potent ialMethod  for  the  potential  term  P(J) 

The  Operator  method  of  the  KineticMethod  class  creates  the  sparse  matrix  associated 
with  the  discretization  of  the  derivative  term  in  equation  (4.2),  and  applies  the  matrix  to  the 
Wigner  distribution  function  /  in  the  Compute  method.  ScatterMethod  calculates  the  result 
of  equation  (4.9),  and  the  Potent  ialMethod  calculates  the  result  of  equation  (4.3)  for  a  given 
/.  However,  to  compute  the  Wigner  potential  term,  T(x,k-  k')  must  be  known  for  all  values 
of  k' .  Thus,  several  more  classes  were  created  to  break  up  the  computation  of  T(x,k-k')  into 
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more  manageable  pieces.  These  pieces  must  be  computed  before  the  Compute  method  in  the 
Potent ialMethod  class  is  executed. 

Potential  term  classes  The  classes  that  feed  into  the  calculation  of  the  Wigner  potential 
term  are: 

•  TcMethod  and  TpMethod  to  calculate  each  respective  portion  of  the  T  integral 

•  SinMat  to  calculate  the  sin(2 yk)  terms 

•  PoissonMethod  to  calculate  the  electrostatic  potential  up(x)  used  in  the  TpMethod 
class 

•  EleDensMethod  to  calculate  the  electron  density  function  n(x )  used  in  the 
PoissonMethod  class 

Since  the  potential  energy  U(x )  in  the  computation  of  T(x,k  -  k')  is  the  sum  of  two 
functions,  we  can  break  T(x,k  -  k')  into  two  separate  integrals  and  assign  a  C++  class  to 
each: 


Nc+l 

T(xt,  kj  -  kr)  =  ^  [  U  (Xj  +  xf)  -  U(x,  -  x,')\  sin(2  xe(kj  -  kr)wr  (6.1) 

i'= 1 

Nc+ 1 

=  ^  [up(xi  +  Xi>)  -  Up(xi  -  Xi> )]  sin(2 x^(kj  -  ky)wi>  (6.2) 

i'= 1 
Nc+l 

+  ^  [A c(xi  +  Xi>)  -  A c(xi  -  Xi>)]  sin(2 xv(kj  -  kr)wv 
i'= 1 

=  Tp(xi,  kj  -  kj>)  +  Tc(xu  kj  -  kj ,)  (6.3) 

where  Tp(x,k  -  k')  represents  the  integral  associated  with  the  electrostatic  potential  up(x) 
and  Tc(x ,  k  -  k')  the  integral  associated  with  the  energy  band  function  Ac(x).  The  associated 
C++  classes  are  TpMethod  and  TcMethod  respectively.  The  values  of  sin(2 yk)  for  each 
(y,  k)  e  [0,  L]  x  (-2 Kmax,  2 Kmax)  are  computed  and  stored  by  the  SinMat  class,  and  then  used 
to  compute  the  integrals  evaluated  in  the  Compute  methods  of  TcMethod  and  TpMethod. 
Due  to  the  fixed  nature  of  the  inputs  to  TcMethod,  T c(x,  k  -  k')  is  calculated  once  and  stored 
for  use  throughout  the  remaining  computations.  However,  Tp(x,  k  -  k')  depends  upon  up(x), 
which  in  turn  depends  on  the  electron  density  n(x ),  so  additional  classes  were  created  to 
compute  these  functions. 

The  PoissonMethod  class  has  an  Operator  method  which  creates  the  sparse  matrix  as¬ 
sociated  with  the  second  order  central  difference  approximation  in  equation  (4.7),  and  the 
associated  Compute  method  solves  the  linear  system  Ax  =  b.  Here  A  is  the  sparse  matrix  and 
b  includes  the  piecewise  continuous  doping  profile  Nj(x)  and  the  electron  density  n(x)  which 
is  computed  via  the  EleDensMethod  class.  The  Compute  method  uses  Trilinos’  Amesos 
package  to  solve  the  system  via  a  serial  direct  solver  called  KLU. 

Finally,  once  the  Wigner  distribution  function  has  been  fully  calculated,  a  separate  class 
CurrentMethod  computes  the  current  density  using  equation  (4.10). 

7.  Solution  Process  using  Trilinos.  To  solve  the  Wigner-Poisson  system  using  Trilinos, 
matrix  and  vector  elements  used  in  the  computations  are  compatible  with  Epetra  data  struc¬ 
tures,  such  as  Epetra_CrsMatrix  and  Epetra_Vector,  and  the  equations  are  solved  using  the 
nonlinear  solver  package  NOX  and  the  associated  continuation  package  LOCA.  A  Problem 
class  is  created  to  call  the  Compute  methods  for  each  class  involved  in  the  continuation  pro¬ 
cess,  and  a  Problemlnterface  class  is  created  to  implement  the 
LOCA::Epetra::Interface::Required  interface.  The  kinetic  and  potential  terms  (and  when  the 
ending  voltage  Vl  ±  Vo,  the  scattering  terms)  are  then  combined  to  solve  for  the  Wigner 
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function  f(x,k).  The  flowchart  depicted  in  figure  7.1  represents  these  processes,  where  the 
classes  above  the  dotted  black  line  are  called  in  the  constructor,  and  the  classes  below  the  line 
are  called  in  the  ComputeF  method. 


BiMiiMUry 


/Hrw  Wlgmr 
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Fig.  7.1.  Flowchart  of  the  detailed  object  oriented  solution  process.  The  classes  above  the  dotted  black  line  are 
computed  in  the  Problemlnterface  constructor,  and  those  below  are  part  of  the  continuation.  The  dark  grey  shaded 
boxes  indicate  inputs  that  are  subject  to  change  via  the  continuation  process. 


The  system  of  equations  will  be  solved  as  two  homotopy  problems.  The  first  involves 
continuation  on  the  barrier  profile  to  calculate  the  initial  Wigner  distribution  /q.  Continuation 
is  started  using  a  barrier  profile  Ac(x)  =  0  and  increases  up  to  the  full  Ac(x).  The  second 
continuation  run  is  to  calculate  the  current  density  as  a  function  of  the  bias  voltage  applied  to 
the  device,  which  starts  at  0  and  increases  up  to  Vp.  These  two  processes  are  depicted  in  the 
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second  flowchart  in  figure  7.2. 
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Fig.  7.2.  High  level  flowcharts  for  the  continuation  processes.  The  top  chart  describes  the  continuation  on  the 
barrier  profile  to  compute  the  initial  Wigner  distribution  fo;  the  bottom  chart  is  the  continuation  process  to  compute 
the  full  voltage  sweep  from  Vo  =  0  to  Vl • 


To  start  the  computation,  an  initial  guess  for  /  must  be  chosen,  taking  the  boundary 
conditions  on  the  Wigner  equation  into  account.  Then  equation  (3.3)  is  solved  with  Ac(x)  =  0 
and  VL  =  Vo  =  0.  The  barrier  profile  is  increased  by  multiplying  Ac(x)  by  a  constant  and 
then  increasing  the  constant  incrementally  from  0  to  1,  solving  for  /  at  each  incremental  step. 
This  is  done  via  Trilinos  by  calling  LOCA.  The  final  Wigner  solution  /  using  the  full  barrier 
profile  Ac(x)  is  the  initial  Wigner  distribution  /0. 

Once  fo  has  been  computed,  continuation  is  performed  again,  this  time  by  including  S  (/) 
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in  the  Wigner  computation  and  solving  equation  (3.12)  for  /  using  the  ending  bias  value  as 
the  continuation  parameter.  Once  each  /  is  calculated,  the  current  density  can  be  computed 
and  stored  along  with  the  associated  voltage  for  use  in  producing  a  current  vs.  voltage  curve. 
The  voltage  is  then  increased  up  to  the  final  voltage  value  VL. 

8.  Conclusions.  The  C++/Trilinos  implementation  defined  above  provides  more  com¬ 
putational  efficiency  as  well  as  greater  flexibility  in  modeling  a  variety  of  devices  than  pre¬ 
vious  models  do.  Although  other  implementations  exist  in  both  Matlab  and  Fortran,  Trilinos 
is  written  in  C++  to  handle  memory  allocation  and  computations  more  efficiently,  which  will 
allow  finer  meshes  to  be  simulated  in  shorter  runs  times  than  those  in  previous  work  [12,  17]. 
Coupling  Trilinos  with  a  C++  implementation  of  the  Wigner-Poisson  formulation  is  more 
efficient  than  the  previous  Fortran/Trilinos  implementation  [9,  10],  and  Trilinos’  ability  to 
handle  parallel  computation  will  decrease  run  times  even  more. 

The  object  oriented  design  of  the  C++  implementation  gives  flexibility  for  investigating 
new  algorithms.  For  example,  changing  the  discretization  of  the  Poisson  equation  is  a  local 
change  in  one  class.  Similarly,  tradeoffs  between  memory  usage  and  FLOPS,  such  as  storing 
the  entire  sin(2yk)  matrix,  can  be  easily  investigated  with  a  local  change  to  one  class. 

Also,  due  to  the  flexibility  inherent  in  the  data  input  structures  (such  as  the  Barrier 
class  and  the  DevProps  struct),  changes  to  the  device  can  be  made  easily  and  consistently. 
Previous  models  hard-coded  most  of  this  information  [9,  10,  12],  such  as  the  number  of  bar¬ 
riers  and  the  material  properties,  and  thus  limited  the  types  of  devices  that  could  be  analyzed. 
Therefore,  the  C++/Trilinos  implementation  will  increase  the  breadth  of  knowledge  about 
RTDs  and  potentially  other  nanoscale  devices  that  can  be  modeled  using  the  Wigner-Poisson 
formulation. 

9.  Future  Work.  Once  the  code  outlined  in  this  paper  is  complete,  there  are  a  vari¬ 
ety  of  enhancements  to  be  made  that  will  improve  performance  and  increase  the  amount  of 
information  available  on  resonant  tunneling  diodes: 

•  Incorporate  fourth  order  methods.  Currently,  most  of  the  numerical  approxima¬ 
tions  use  second  order  accurate  methods  since  they  are  slightly  easier  to  code.  The 
first  improvement  to  be  made  will  be  to  switch  out  the  second  order  methods  for 
fourth  order  methods,  which  will  require  minor  changes  in  the  Compute  methods 
for  several  classes. 

•  Incorporate  analytic  solutions.  In  this  and  previous  Fortran  versions  of  the  RTD 
model,  numerical  approximations  are  used  to  compute  solutions  involving  the  piece- 
wise  linear  terms  Ac(x)  and  Nd(x).  Partial  solutions  to  the  equations  involving 
these  terms  can  be  computed  analytically,  and  these  solutions  incorporated  into 
the  Wigner-Poisson  formulation.  This  will  entail  changes  to  the  TcMethod  and  the 
PoissonMethod. 

•  Implement  a  non-uniform  grid.  The  current  uniform  grid  places  a  large  number  of 
grid  points  in  regions  where  the  physics  of  the  problem  changes  very  little  -  mainly 
in  the  doped  areas  on  either  end  of  the  device,  and  away  from  the  k  =  0  line  in  the 
momentum  dimension.  Implementing  a  non-uniform  grid  which  concentrates  the 
grid  points  in  the  areas  with  greater  variety  in  the  Wigner  function  will  refine  the 
solution  in  those  regions  while  decreasing  computation  time. 

•  Solve  the  time  dependent  Wigner  equation.  Although  equation  (3.12)  shows  the 
time  dependent  nature  of  the  Wigner  equation,  the  solution  process  outlined  here 
computes  the  steady  state  version  of  the  problem.  We  would  like  to  solve  the  full 
time  dependent  version  to  determine  the  oscillatory  nature  of  the  solution  in  the 
region  of  negative  differential  resistance,  and  to  search  for  bifurcation  values. 
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Architecture  and  Networking 

Computational  algorithms  frequently  require  large  numbers  of  reliable  operations  and 
ever  increasing  amounts  of  power.  The  articles  in  this  section  present  novel  architectures 
to  better  utilize  available  resources  and  new  simulation  techniques  to  better  understand  the 
reliability  and  power  requirements  of  future  machines. 

Curry  et  al.  present  a  new  RAID  capability  for  storage  in  high  performance  computing 
that  offloads  error  correction  calculations  to  a  graphics  processing  unit  (GPU).  The  perfor¬ 
mance  of  this  new  technology  out  performs  a  software  based  RAID  controller  and,  due  to  the 
commodity  nature  of  the  GPU,  compares  favorably  to  expensive  hardware  based  controllers. 
Van  Vorst  et  al.  discuss  progress  made  in  the  development  of  a  real-time  large-scale  network 
simulation  testbed  on  Cray  XT  supercomputers.  This  work  promises  to  provide  network  re¬ 
searchers  with  an  environment  capable  of  running  large-scale  simulations.  Ruiz  Varela  et  al. 
use  simulations  to  compare  various  techniques  for  resilient  computations  on  exascale  plat¬ 
forms.  The  authors  also  propose  a  new  simulation  method  for  exploring  efficient  resilience 
mechanisms  on  future  machines.  Vaughan  et  al  discuss  the  use  of  root  cause  analysis  to 
better  understand  the  cause  and  source  of  failures  in  supercomputing  environments.  They 
demonstrate  the  effectiveness  of  these  techniques  using  a  combination  of  system  logs  and 
statistical  analysis.  Merritt  and  Pedretti  evaluate  several  techniques  for  data  distribution  for 
multi-threaded  multi-core  computers.  Based  on  this  evaluation,  the  authors  propose  a  kernel 
level  dynamic  runtime  data  distribution  system.  Thompson  et  al.  discuss  extending  the  Struc¬ 
tural  Simulation  Toolkit  to  handle  and  analyze  power  consumption  data.  Additionally,  using 
existing  power  modeling  toolkits  results,  the  authors  present  sample  results  of  using  this  new 
capability.  Kersey  and  Rodrigues  explore  the  use  of  process  layers  for  large-scale  discrete 
event  simulation  of  computer  hardware.  Based  on  a  use  case  study,  they  propose  perfor¬ 
mance  improvement  methods  that  avoid  some  of  the  common  costs  associated  with  process 
layers.  Williams  and  Rodrigues  discuss  and  motivate  the  use  of  the  Structural  Simulation 
Toolkit  for  reliability  simulation  in  the  context  of  high-performance  computing.  The  authors 
also  describe  the  need  for  reliability  simulation  and  the  components  that  go  into  constructing 
such  a  simulation. 


E.C.  Cyr 
S.S.  Collis 


December  17,  2010 


296 


CSRI  Summer  Proceedings  2010 


CSRI  Summer  Proceedings  2010 


297 


A  LIGHTWEIGHT,  GPU-BASED  SOFTWARE  RAID  SYSTEM 

MATTHEW  L.  CURRY*,  H.  LEE  WARD1,  ANTHONY  SKJELLUM*,  AND  RON  BRIGHTWELL§ 

Abstract.  While  RAID  is  the  prevailing  method  of  creating  reliable  secondary  storage  infrastructure,  many 
users  desire  more  flexibility  than  offered  by  current  implementations.  Traditionally,  RAID  capabilities  have  been 
implemented  largely  in  hardware  in  order  to  achieve  the  best  performance  possible,  but  hardware  RAID  has  rigid 
designs  that  are  costly  to  change.  Software  implementations  are  much  more  flexible,  but  software  RAID  has  histor¬ 
ically  been  viewed  as  much  less  capable  of  high  throughput  than  hardware  RAID  controllers.  This  work  presents  a 
system,  Gibraltar  RAID,  that  attains  high  RAID  performance  by  offloading  the  calculations  related  to  error  correct¬ 
ing  codes  to  GPUs.  This  paper  describes  the  architecture,  performance,  and  qualities  of  the  system.  A  comparison  to 
a  well-known  software  RAID  implementation,  the  md  driver  included  with  the  Linux  operating  system,  is  presented. 
While  this  work  is  presented  in  the  context  of  high  performance  computing,  these  findings  also  apply  to  a  general 
RAID  market. 


1.  Introduction.  RAID  (Redundant  Array  of  Independent  Disks)  is  a  technology  that 
allows  for  groups  of  disks  to  be  joined  in  order  to  be  viewed  as  a  single  logical  block  de¬ 
vice  [4] .  Chen  et  al.  describe  RAID  as  a  method  of  combining  several  disks  in  order  to  have 
larger,  single  volumes  that  are  presented  in  a  familiar  way,  improving  the  overall  reliability 
of  the  storage,  and  increasing  the  overall  speed  of  operations  on  the  volume  by  exploiting 
parallel  transfers  from  all  disks.  They  define  RAID  levels  5  and  6,  which  are  described  as 
obtaining  parallelism  between  disks  by  splitting  data  across  disks  in  a  block-cyclic  manner 
using  a  fixed  chunk  size.  They  show  that  transfers  significantly  larger  than  this  chunk  size  can 
exploit  multiple  disks  simultaneously  for  different  parts  of  the  transfer,  while  small  transfers 
that  are  not  close  together  can  also  exercise  several  disks  independently. 

Fault  tolerance  is  a  necessary  part  of  RAID  because  combining  multiple  components  in¬ 
creases  the  average  rate  of  failure  of  components.  For  example,  RAID  0  is  the  only  RAID 
level  that  does  not  contain  redundant  information.  A  RAID  0  array  composed  of  n  disks 
has  the  Mean  Time  To  Data  Loss  (MTTDL)  MTTDL  =  MTTF/n ,  where  MTTF  is  the  mean 
time  to  failure  of  a  single  disk.  Single  disks  can  have  an  MTTF  on  the  order  of  one  mil¬ 
lion  hours  [18],  yielding  an  expected  lifetime  of  114  years.  However,  even  a  moderately 
sized  RAID  0  array  of  twenty  disks  can  have  a  MTTDL  of  less  than  an  installation’s  desired 
lifetime. 

Most  RAID  configurations  include  some  level  of  redundancy,  including  RAID  5  and 
RAID  6,  which  contain  parity.  These  parity-based  levels  generally  implement  systems  of 
k  +  m  disks  where  any  m  disks  may  fail  without  losing  data.  RAID  5  and  6  have  fixed  values 
m  =  1  and  m  =  2,  respectively.  Reliability  is  incorporated  into  the  array  by  interleaving  m 
chunks  of  parity  for  every  the  k  chunks  of  data,  spreading  them  evenly  over  all  n  =  k  +  m 
disks  at  the  same  offset  to  create  a  stripe.  The  documents  describing  RAID  [4]  specify  that 
the  parity  is  generated  such  that  any  k  chunks  of  data  and  parity  are  required  to  recover  the 
k  original  chunks  of  data.  The  means  of  generation  is  left  up  to  the  implementer,  with  many 
specialized  erasure  codes  existing  for  RAID  6  [5,  14].  Reed-Solomon  coding  [16]  can  be 
used  to  perform  k  +  m  codings  for  m  >  2. 

RAID  implementations  for  high  performance  workloads  are  typically  implemented  in 
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hardware,  because  hardware  has  historically  provided  much  higher  performance  in  many 
situations.  RAID  performs  erasure  correcting  coding  on  all  data  written  [4],  necessitating 
a  high-speed  implementation  of  erasure  correction.  This  work  aims  to  achieve  greatly  im¬ 
proved  software  RAID  performance  by  offloading  erasure  correction  operations  onto  com¬ 
modity  graphics  processing  units  (GPUs). 

The  authors  have  already  shown  that  GPUs  are  more  well-suited  to  performing  Reed- 
Solomon  coding  than  CPUs  [6,  8,  7].  This  capability  is  attributable  to  a  lack  of  appropriate 
x86-64  instructions  for  performing  Reed-Solomon  coding  [3].  While  GPUs  also  lack  these 
instructions,  the  memory  architecture  and  the  large  number  of  cores  both  contribute  to  much 
higher  rates  of  coding  and  decoding.  This  work  integrates  the  Gibraltar  library  [7],  a  GPU- 
based  Reed-Solomon  coding  library,  into  a  system  called  Gibraltar  RAID. 

2.  Motivation.  This  work  occupies  an  interesting  space  between  hardware-  and  software- 
based  RAID  controllers,  inheriting  attributes  from  each.  The  overall  goal  is  to  provide  in¬ 
creased  flexibility  over  hardware-based  RAID  implementations,  and  increased  speed  over 
software-based  implementations.  This  section  details  the  benefits  of  a  high-performance  soft¬ 
ware  RAID  implementation  over  hardware  RAID. 

2.1.  Lightweight.  One  reason  for  the  expense  of  high-performance  RAID  controllers 
is  the  lack  of  HPC- specific  controllers  in  the  marketplace.  Instead,  HPC  customers  purchase 
products  aimed  at  the  larger  enterprise  customer  base.  The  enterprise  market  demands  a 
controller  that  often  has  many  more  features  than  are  strictly  needed  for  HPC,  like  deduplica¬ 
tion  [9] .  By  only  implementing  the  features  that  are  required  for  high-performance  streaming 
I/O  (that  is,  storing  data  reliably  and  serving  data  to  a  parallel  file  system),  costs  can  remain 
relatively  low  while  servicing  HPC  (and  similar)  needs  just  as  effectively. 

2.2.  Inexpensive.  Commodity  hardware  has  impacted  HPC  for  the  better,  yielding  sys¬ 
tems  that  are  much  more  powerful  while  maintaining  cost-effectiveness.  For  example,  today’s 
primary  architecture  for  supercomputers  is  the  cluster  of  workstations,  which  are  largely  cre¬ 
ated  from  commodity  components.  Initial  investigations  into  this  architecture  were  performed 
to  field  an  economical  alternative  to  expensive,  proprietary  systems  [20] .  As  the  commodity 
computing  market  grew,  improvements  to  the  processing  capacity  of  home  and  business  com¬ 
puters  were  leveraged  by  new  clusters,  forming  more  powerful  supercomputers.  The  result  is 
that  the  largest  machines  are  built  with  commodity  processor  technology  [2]. 

GPU  vendors  have  been  continuing  the  commodity  HPC  trend  by  opening  a  general- 
purpose  API  to  program  processors  that  are  usually  intended  to  render  graphics  for  interac¬ 
tive  animation  [10].  GPUs  have  mostly  been  used  to  improve  on  CPU  performance,  with 
several  important  applications  accelerated  significantly  with  GPUs  [11].  Applying  GPUs  to 
RAID  is  unusual  because  the  GPU  is  intended  to  replace  a  non-commodity  part  -  the  erasure 
correction  engines  used  in  RAID  controllers.  The  benefits  of  large  production  and  contin¬ 
ual  improvements  for  the  graphics  market  allow  this  application  to  benefit  from  improving 
GPUs  without  much  effort.  This  situation  is  in  direct  contrast  to  the  improvement  of  RAID 
controllers,  which  usually  require  redesign  to  incorporate  improvements  in  technology. 

2.3.  Extended  RAID  Features.  While  a  reduced  set  of  features  is  certainly  a  benefit, 
another  goal  behind  this  project  is  to  provide  exactly  the  features  that  are  required,  including 
those  not  offered  with  less  full-featured  RAID  controllers.  Certain  applications  require  high 
levels  of  reliability  and  data  integrity,  but  RAID  controllers  often  do  not  provide  features 
beyond  simple  RAID  6.  We  have  integrated  two  uncommon  features  into  Gibraltar  RAID: 
Extra  parity  and  read  verification. 
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2.3.1.  Extra  Parity.  Recently,  there  has  been  much  written  that  indicates  how  RAID  6, 
which  allows  for  two  disks  to  fail  without  data  loss,  is  not  going  to  be  sufficient  to  ensure 
data  reliability  in  the  near  future.  A  specific  example  questions  the  validity  of  hard  disk 
manufacturers’  reliability  estimates  of  their  own  disks  [17].  Others  also  have  shown  that 
many  hard  disk  failures  are  correlated,  resulting  in  an  increased  probability  of  array  failure 
once  the  first  disk  has  failed  [13].  Further,  the  growing  discrepancy  between  disk  size  and 
disk  speed  is  increasing  the  chances  of  disk  failure  during  array  reconstruction  once  a  disk 
has  already  failed. 

An  additional  problem  is  encountering  an  unrecoverable  read  error  (URE)  during  recon¬ 
struction  after  redundancy  has  been  eliminated  from  the  array.  UREs  can  occur  for  many 
reasons,  including  media  damage,  a  reduction  of  magnetic  coating  on  the  platters,  or  compo¬ 
nent  wear,  and  result  in  data  loss  for  the  affected  sector.  A  RAID  array  operating  without  any 
failed  disks  can  recover  from  this  loss,  but  an  array  that  lacks  redundancy  cannot.  For  exam¬ 
ple,  if  a  RAID  6  array  with  two  failed  disks  attempts  to  rebuild  onto  clean  disks,  but  receives 
a  URE  from  one  of  the  remaining  disks,  the  contents  of  the  affected  sector  are  completely 
lost.  If  the  two  corresponding  sectors  on  the  failed  disks  are  used  for  data  blocks,  then  those 
sectors  are  not  recoverable. 

UREs  have  not  historically  been  a  great  concern,  but  the  typical  hard  disk  size  has  in¬ 
creased  while  the  URE  rate  has  not  decreased.  Given  a  typical  SATA  disk  with  a  URE  rate  of 
one  sector  per  1014  bits  [19],  a  read  error  can  occur  (on  average)  once  every  12.5  TB.  Many 
large  RAID  arrays  are  significantly  larger  than  12.5  TB  and  are  heavily  loaded,  implying  that 
such  arrays  will  have  a  high  probability  of  encountering  a  URE  within  context  of  a  rebuild, 
causing  data  loss. 

One  mitigating  strategy  for  UREs  is  scrubbing  [12].  During  scrubbing  of  an  array,  the 
RAID  controller  reads  all  disks  from  beginning  to  end,  attempting  to  detect  UREs  and  parity 
mismatches  as  a  background  process  during  normal  operation  of  a  RAID  array.  When  a  URE 
is  encountered,  the  controller  recovers  the  contents  of  the  sector  with  parity  and  requests  the 
disk  to  rewrite  the  information.  The  disk  will  either  rewrite  the  data  (if  the  sector  is  not 
damaged)  or  remap  the  sector  to  another  within  a  pool  of  spare  sectors.  This  process  is  not 
possible  without  redundancy  in  the  array  to  use  for  data  recovery.  Scrubbing  is  not  the  perfect 
solution:  It  consumes  bandwidth,  reducing  the  bandwidth  available  to  client  applications;  and 
scrubbing  does  not  work  if  parity  in  the  array  is  exhausted  and  the  array  is  rebuilding,  when 
the  array  is  likely  to  encounter  a  URE. 

One  way  to  lessen  the  above  problems  of  UREs,  less  reliable  disks,  and  batch-correlated 
failures  is  to  introduce  higher  degrees  of  parity.  Current  hardware  RAID  controllers  are  stat¬ 
ically  configured  with  industry- standard  and  vendor- specific  RAID  levels,  and  do  not  allow 
for  arbitrary  k+m  redundancy  within  an  array.  The  primary  distinguishing  feature  of  Gibraltar 
RAID  is  the  ability  to  use  the  values  of  m  that  users  require  for  their  applications,  constrained 
only  by  space  utilization  and  performance  requirements.  An  array  implemented  with  the  con¬ 
troller  described  here  can  be  dynamically  tuned  to  provide  the  desired  amount  of  parity  for 
each  installation  to  make  efficient  trade-offs  in  performance,  reliability,  capacity,  and  scale. 

2.3.2.  Read  Verification.  In  large  computer  installations,  component  failure  is  unavoid¬ 
able.  Some  of  the  failures  that  occur  are  not  easily  diagnosable  because  of  the  lack  of  detec¬ 
tion  for  certain  trouble  conditions.  For  example,  a  faulty  disk  cable  can  be  responsible  for  a 
large  amount  of  silent  data  corruption  between  a  RAID  controller  and  its  disks.  However,  a 
controller  that  does  not  verify  reads  is  not  capable  of  detecting  the  problems  imposed  by  a 
bad  cable,  creating  mysterious  data  corruption  and  failures  in  an  application. 

Furthermore,  increasingly  small  feature  sizes  and  the  resulting  highly  dense  computer 
installations  increase  the  probability  of  single  events  (like  an  alpha  particle  or  neutron  strike) 
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causing  silent  corruption  of  data.  While  most  components  can  and  do  include  their  own 
error  correction  between  different  components  to  guard  against  data  corruption,  missing  error 
correction  circuitry  (and  less  robust  types)  can  allow  corrupted  data  to  be  propagated  to  a  user 
application. 

Gibraltar  RAID  effectively  detects  corruption  arising  from  these  problems  by  verifying 
reads  at  high  speeds.  Read  verification  can  be  used  to  effectively  recover  from  single  events, 
detect  intermittent  or  chronic  problems  with  components,  and  continue  operating  in  spite  of 
less  severe  problems.  Additionally,  using  read  verification  can  correct  UREs  without  request¬ 
ing  extra  data  or  seriously  reducing  the  speed  of  operation,  as  the  data  required  to  recover  the 
sector  is  initially  requested  for  data  verification.  This  reduces  the  latency  between  encounter¬ 
ing  a  URE  and  having  available  data  to  recover  the  associated  sector. 

High-speed  recovery  of  data  lends  itself  to  read  verification  when  some  disks  within 
an  array  have  failed  and  a  rebuild  has  not  been  completed.  Gibraltar  RAID  supports  this  by 
recovering  more  of  the  data  than  is  strictly  necessary  based  on  the  operating  mode  of  an  array; 
specifically,  during  recovery  of  a  missing  chunk,  a  different  chunk  in  the  same  stripe  that  is 
available  on  disk  is  read  and  also  recovered  in  the  same  recovery  operation  that  regenerates 
the  missing  chunk.  The  recovered  chunk  and  the  read  chunk  can  be  compared  to  ensure 
correctness.  This  process  does  not  work  without  extra  redundancy  in  the  array,  indicating  a 
need  for  enough  parity  to  ensure  that  redundancy  is  always  present. 

3.  Architectural  Description.  Because  a  GPU  is  used  for  coding  (generating  parity) 
and  decoding  (recovering  lost  data),  Gibraltar  RAID  was  implemented  entirely  in  user  space. 
Interactions  between  the  RAID  array  and  network  clients  can  be  managed  through  the  Linux 
SCSI  Target  Framework  [1].  The  interface  between  Gibraltar  RAID  and  the  target  is  gen¬ 
eral,  allowing  Gibraltar  RAID  to  be  used  with  other  network  storage  packages  that  have  user 
space  operation.  Gibraltar  RAID  cannot  take  advantage  of  many  facilities  for  I/O  within  the 
operating  system  kernel,  so  many  portions  of  the  underlying  secondary  storage  stack  had  to 
be  reimplemented  within  Gibraltar  RAID  in  user  space.  This  section  details  the  mechanics  of 
the  components  that  have  been  reimplemented,  and  how  the  components  communicate. 

Figure  3.1  has  been  provided  to  aid  with  the  description  following.  Different  data  paths 
are  outlined  as  follows:  Writes  are  denoted  with  a  “w,”  reads  with  an  “r,”  and  victimization 
with  a  “v”  Each  interaction  is  noted  with  a  letter  indicating  the  data  path  and  a  number 
indicating  the  order  in  which  the  interactions  occur. 

3.1.  Architectural  Overview.  There  are  two  main  restrictions  that  govern  the  overall 
design  of  a  GPU-based  RAID  system.  First  and  foremost,  an  application  using  the  NVIDIA 
CUDA  toolkit  must  run  in  user  space.  No  public,  documented  interfaces  to  the  NVIDIA 
run-time  exist  that  are  available  from  a  driver  that  runs  in  kernel  space.  Given  that  the  iSCSI 
target  used  for  this  project  runs  in  user  space,  the  requirement  that  the  coding  occurs  in  user 
space  presents  little  logistical  difficulty.  However,  if  desired,  a  small  driver  that  passes  block 
requests  to  a  user  space  daemon  is  not  difficult  to  consider.  Regardless,  GPU  operations  must 
be  performed  in  user  space.  To  simplify  design  and  development,  and  to  minimize  user  mode 
to  kernel  mode  transitions,  the  other  components  are  also  located  in  user  space. 

Second,  as  a  user  space  service  that  requires  knowledge  of  the  contents  of  cache  in  order 
to  verify  reads,  Gibraltar  RAID  must  bypass  the  Linux  buffer  cache  when  accessing  disks. 
If  this  is  not  done,  a  user  application  may  request  data  that  is  still  in  the  buffer  cache,  and 
Gibraltar  RAID  will  have  no  means  of  knowing  whether  that  data  came  from  the  disk  or  was 
already  in  RAM.  Parity  would  then  be  re- verified,  which  is  expensive  compared  to  simply 
serving  the  request  from  the  buffer  cache. 

It  is  possible  to  keep  a  large  cache  in  user  space  and  still  use  the  buffer  cache  to  schedule 
I/O  to  and  from  disk.  However,  an  application  that  is  required  to  deal  with  large  amounts  of 
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Fig.  3.1.  Gibraltar  RAID  Architecture  and  Data  Flow  Diagram 


information  in  its  own  cache  needs  a  high  degree  of  control  over  the  memory  it  is  using  (and 
memory  used  by  the  kernel  on  the  application’s  behalf)  in  order  to  prevent  running  afoul  of 
Linux’s  optimistic  memory  allocation.  If  the  system  is  low  on  memory  and  an  application 
calls  malloc  () ,  virtual  memory  can  be  allocated  to  the  process  in  absence  of  physical  mem¬ 
ory.  The  assumption  is  that  the  memory  either  will  not  be  completely  used,  or  more  memory 
will  become  available  to  support  this  allocation.  For  storage  applications  like  RAID,  the  gen¬ 
eral  pattern  is  to  write  data  to  disk,  then  acquire  more  data  from  users  or  other  input  devices. 
If  the  buffer  cache  is  using  a  large  amount  of  memory,  as  it  tends  to  do,  it  can  interfere  with 
the  application’s  ability  to  maintain  high  bandwidth. 

The  requirement  to  bypass  the  Linux  buffer  cache  necessitates  using  the  CLDIRECT  flag 
when  opening  the  devices,  which  allows  a  user  to  perform  I/O  operations  directly  from  mem¬ 
ory  in  user  space  rather  than  through  the  Linux  buffer  cache.  O -DIRECT  unfortunately  con¬ 
flicts  with  some  CUDA  functionality,  which  includes  special  memory  allocators  for  optimiz¬ 
ing  memory  transfers  between  host  memory  and  GPUs.  Such  allocators  perform  operations 
like  map  host  memory  into  the  GPU,  which  provides  increased  overlap  between  PCI-Express 
transfers  and  computation  within  the  GPU;  and  prevent  memory  regions  from  being  paged 
out  by  the  virtual  memory  system,  which  saves  a  memory  copy  when  transferring  data  to 
a  GPU.  Direct  I/O  operations  from  memory  regions  allocated  with  these  allocators  fail,  so 
these  optimizations  are  unavailable.  This  decreases  the  effectiveness  of  the  GPU  significantly 
because  of  the  need  for  increased  host  memory  copies,  but  GPU  operations  still  maintain 
significant  speed  improvements  over  a  CPU-based  software  RAID. 

3.2.  System/RAID  Interface.  All  interaction  between  the  target  and  Gibraltar  RAID 
occurs  through  functions  implementing  the  interfaces  for  the  standard  C  library  calls  pread() 
and  pwrite  O .  Each  function  takes  a  pointer  to  a  user  buffer,  an  offset  into  the  RAID  device, 
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and  the  length  of  the  desired  data  to  be  copied  into  the  user  buffer.  Each  request  is  mapped 
by  these  functions  to  the  stripes  affected,  and  asynchronous  requests  for  those  stripes  are  sub¬ 
mitted  to  the  Gibraltar  RAID  cache.  These  requests  are  filled  without  knowledge  of  whether 
the  stripes  are  present  in  the  cache,  so  the  stripe  can  be  requested  in  one  of  two  ways: 

•  Request  stripe  with  its  full,  verified  contents;  or 

•  Request  a  “clean”  stripe,  which  may  return  an  uninitialized  stripe  and  does  not  incur 
disk  I/O. 

The  second  option  is  useful  if  the  stripe  is  to  be  completely  overwritten  by  a  write  request, 
or  if  it  is  expected  to  be  overwritten.  Client  service  threads,  threads  created  by  the  target  to 
satisfy  read  and  write  requests  from  network  clients,  call  pread  and  pwrite. 

For  reads,  additional  stripes  may  also  be  requested  in  order  to  provide  read  ahead  capa¬ 
bility  that  takes  advantage  of  spatial  locality  of  disk  accesses.  In  the  case  of  streaming  reads, 
there  is  a  high  probability  that  a  read  for  a  block  of  data  will  be  quickly  followed  by  a  request 
for  the  next  block  of  data.  The  Linux  kernel  buffer  cache  does  this;  read  ahead  is  necessary 
to  get  the  best  performance  out  of  a  storage  system  under  streaming  workloads. 

After  all  read  or  write  requests  have  been  registered,  the  client  service  thread  waits  for 
the  requests  to  be  fulfilled.  In  the  case  of  a  read,  the  routine  passes  each  stripe  to  the  erasure 
coding  component  to  be  verified  or  recovered.  Writes  can  be  recorded  as  incomplete  updates 
to  a  stripe,  anticipating  that  the  whole  stripe  will  eventually  be  overwritten.  If  this  does  not 
happen  before  a  stripe  is  chosen  for  victimization,  the  stripe  must  be  read,  verified,  modified 
with  the  contents  of  the  incomplete  stripe,  updated  with  regenerated  parity,  and  written. 

3.3.  Stripe  Cache.  Gibraltar  RAID  includes  a  stripe  cache  that  operates  asynchronously 
with  the  I/O  thread  and  the  user’s  requests  for  reads  and  writes.  In  the  event  of  a  read  request 
to  the  cache,  the  cache  submits  requests  to  the  I/O  thread  to  read  the  relevant  stripes  from 
disk  in  their  entirety,  including  parity  blocks  for  read  verification.  This  full-stripe  operation 
can  be  viewed  as  a  slight  mandatory  read  ahead. 

Writes  have  a  simple  implementation  in  the  stripe  cache,  as  they  do  not  require  immediate 
disk  operations.  The  cache  is  optimistic;  if  the  write  request  does  not  fill  an  entire  stripe, 
the  cache  assumes  that  the  stripe  will  be  completely  populated  before  being  flushed  to  disk. 
Therefore,  no  reads  are  required  before  a  partial  update  to  a  stripe  is  made.  If  a  stripe  is  in 
the  process  of  being  read,  the  cache  will  force  the  client  thread  to  wait  until  the  read  has  been 
completed.  If  the  thread  has  already  been  previously  read,  it  will  be  simply  given  to  the  client 
thread  to  be  updated  with  the  write  contents.  Otherwise,  a  blank  stripe  is  returned,  and  the 
client  is  responsible  for  maintaining  the  clean/dirty  statistics  related  to  the  writes  requested. 

A  victimizer  thread  occasionally  (because  of  an  external  stimulus  from  the  cache  related 
to  memory  pressure)  deletes  clean  stripes  and  flushes  dirty  stripes.  The  interface  between 
the  victimizer  and  the  cache  is  flexible,  allowing  many  different  types  of  algorithms  to  be 
implemented.  The  default  mode  of  operation  is  the  Least  Recently  Used  (LRU)  caching 
algorithm,  but  higher  quality  algorithms  may  be  used.  Furthermore,  the  victimizer  can  have 
an  internal  timer  used  to  implement  write-behind  caching. 

3.4.  I/O  Scheduler.  The  scheduler  receives  requests  from  the  cache  for  reading  stripes, 
and  receives  requests  from  the  victimizer  for  writing  stripes.  The  scheduler  accumulates  these 
requests  in  a  queue  until  it  is  ready  to  service  them.  All  of  the  requests  are  received  as  a  batch 
to  facilitate  combining  of  requests  that  are  adjacent  on  disk.  Only  write  requests  are  combined 
for  the  following  reasons: 

•  Clients  are  affected  by  latency  by  having  to  wait  longer  for  a  read  request  to  be 
serviced.  Reads  are  necessarily  synchronous,  so  a  large  combined  read  request  will 
force  all  clients  to  wait  until  the  entire  combined  request  is  serviced. 
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Fig.  3.2.  Performance  of  a  single  disk  in  a  RS-1600-F4-SBD  switched  JBOD  over  4Gbps  Fibre  Channel 


•  Our  experiments  show  that  performing  short  contiguous  reads  easily  obtains  good 
performance  from  a  disk.  Writes,  however,  can  be  significantly  slower  (depending 
on  the  qualities  of  the  system,  such  as  hardware  and  configuration)  unless  a  compar¬ 
atively  large  amount  of  data  is  accumulated  for  writing  at  once.  Figure  3.2  demon¬ 
strates  this  property.  Notice  that  contiguous  writes  of  16  megabytes  are  slightly 
slower  than  contiguous  reads  of  64  kilobytes.  This  performance  degradation  is  only 
apparent  for  files  or  devices  opened  with  the  0_DIRECT  flag,  which  bypasses  the 
Linux  kernel  buffer  cache.  For  normal  disk  operation,  write  requests  are  combined 
in  the  buffer  cache,  hiding  this  potential  performance  issue. 

The  scheduler  receives  all  waiting  requests  as  a  batch,  ordering  and  combining  them  as 
necessary  to  achieve  the  highest  possible  disk  bandwidth.  The  ordering  algorithm  currently 
used  is  the  circular  elevator  algorithm  (C-SCAN). 

Combining  requests  is  conceptually  simple.  If  there  are  two  writes  that  are  adjacent 
on  disk,  it  is  possible  to  use  a  vector  write  in  order  to  combine  the  requests  into  a  single 
write  call.  One  implementation  of  this  type  of  call  available  to  applications  is  pwritevO,  a 
vector  version  of  pwriteO .  Vectored  operations  allow  a  contiguous  area  of  a  disk  to  be  the 
target  for  a  combined  write  operation  from  non-contiguous  areas  of  memory.  However,  using 
pwritevO  is  not  the  best  strategy  for  a  RAID  implementation. 

Asynchronous  I/O,  which  allows  the  Linux  kernel  to  manage  read  and  write  requests 
in  the  background,  is  more  sensible  for  storage-intensive  applications.  Initially,  this  imple¬ 
mentation  used  the  pthreads  library  to  perform  synchronous  reads  and  writes  with  one  thread 
assigned  per  disk.  Switching  to  asynchronous  reads  and  writes  allowed  for  more  efficient  use 
of  resources  than  CPU-intensive  pthread  condition  variables  with  a  high  thread-to-core  ratio 
allow.  While  asynchronous  I/O  has  been  implemented  in  the  Linux  kernel  and  C  libraries  for 
some  time,  the  methods  for  performing  asynchronous  vector  I/O  are  not  well-documented. 

There  is  a  method  of  using  io_submit()  to  submit  iovec  structures  in  an  unconven¬ 
tional  and  difficult- to-discover  way.  The  typical  usage  of  io_submit()  takes  an  array  of 
iocb  structures  as  a  parameter,  which  describes  individual  I/O  operations  to  submit  asyn¬ 
chronously.  However,  to  use  the  relatively  new  vectored  read  and  write  capabilities,  one 
passes  an  array  of  iov  structures  within  the  iocb  structure  instead  of  iocb_common  struc¬ 
tures.  These  will  be  used  to  perform  a  vectored  I/O  operation.  This  is  not  noted  in  system 
documentation. 
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(a)  RAID  6  (m  =  2) 


(b)  RAID-TP  (m  =  3) 


Fig.  3.3.  Performance  of  GPU  for  Coding,  1MB  Chunk  Size 


3.5.  I/O  Notifier.  The  I/O  Notifier  is  a  thread  that  collects  events  resulting  from  the 
asynchronous  I/O  calls,  and  performs  record-keeping  on  a  per- stripe  basis.  Once  all  of  the 
asynchronous  I/O  calls  for  a  stripe  have  been  completed,  the  notifier  notifies  other  threads 
that  depend  on  the  stripe  associated  with  the  I/O.  If  the  stripe  is  undergoing  eviction  from  the 
cache,  this  thread  will  initiate  destruction  of  the  stripe  at  I/O  completion. 

3.6.  Victim  Cache.  When  considering  the  speed  that  new  I/O  requests  can  arrive  at 
the  RAID  controller,  there  is  a  significant  delay  between  the  decision  to  victimize  a  dirty 
stripe  and  the  completion  of  the  write  associated  with  it.  Canceling  a  write  in  progress  is  an 
inefficient  action  because  of  the  combining  of  writes  and  the  asynchronous  completion  of  the 
writes.  To  aid  in  victimization,  a  victim  cache  is  included  that  allows  for  a  client  read  or  write 
request  to  “rescue”  a  stripe  from  being  deleted  before  it  has  been  written,  or  even  while  the 
write  is  still  in  progress.  Maintaining  a  separate  cache  can  allow  for  many  flush  requests  to 
be  in  flight  simultaneously  without  overloading  the  hash  table  for  the  main  cache.  A  separate 
cache  also  allows  fewer  look-up  operations  and,  thus,  locking,  during  the  victimization  of  a 
stripe. 

3.7.  Erasure  Coding  Component.  The  erasure  coding  component  uses  a  the  Gibraltar 
library,  which  was  designed  to  perform  Reed-Solomon  encoding  and  decoding  at  high  rates 
using  GPUs  [7].  Briefly,  the  Gibraltar  library  operates  by  accepting  k  buffers  of  data,  such 
as  striped  data  for  a  RAID  array,  and  returns  m  buffers  of  parity.  This  GPU-based  parity 
calculation  can  encode  and  decode  at  speeds  of  well  over  four  gigabytes  per  second  for  RAID 
6  workloads  on  commodity  GPUs.  Figure  3.3(a)  shows  RAID  6  performance  over  a  variety 
of  stripe  sizes  as  compared  to  a  CPU  implementation  of  Reed-Solomon  coding,  Jerasure  [15]. 
A  unique  feature  of  Gibraltar  RAID  is  the  ability  to  extend  far  beyond  RAID  6  in  the  number 
of  disks  that  may  fail  without  data  loss.  Figure  3.3(b)  shows  performance  for  RAID  TP,  a 
triple-parity  RAID  level  that  can  tolerate  three  disk  failures.  These  tests  were  performed  with 
a  GeForce  285  and  an  Intel  Extreme  Edition  965.  The  most  compelling  feature  is  that  the 
GeForce  285,  though  costing  approximately  60%  less  than  the  Intel  processor,  demonstrated 
a  significantly  increased  performance. 

GPU  parity  calculations  entail  transfer  of  significant  amounts  of  data  across  the  PCI- 
Express  bus  to  the  GPU.  This  implies  that  using  the  Gibraltar  library  in  this  system  also  incurs 
significant  PCI-Express  traffic.  This  traffic  can  be  a  significant  concern  if  other  hardware,  like 
network  adapters  and  host  bus  adapters,  also  heavily  uses  the  PCI-Express  bus. 

4.  Performance.  In  order  to  test  the  performance  of  Gibraltar  RAID,  an  experiment  has 
been  constructed  to  measure  the  speed  of  streaming  raw  I/O  to  a  RAID  array  managed  by 
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Performance 

Linux  md,  RAID  6 

Gibraltar,  RAID  6 

Gibraltar,  RAID  TP 

Write 

612 

637  (min:  438) 

727  (min:  589) 

Read 

878 

774 

720 

Write  (sf) 

369 

669  (min:  470) 

743  (min:  438) 

Read  (sf) 

262 

795 

721 

Write  (df) 

369 

602  (min:  501) 

759  (min:  462) 

Read  (df) 

253 

789 

742 

Write  (tf) 

N/A 

N/A 

649  (min:  489) 

Read  (tf) 

N/A 

N/A 

767 

sf:  Single  Failure,  df:  Double  Failure,  tf:  Triple  failure 

Table  4. 1 

Performance  of  Gibraltar  RAID  compared  to  Linux  md,  for  streaming  operations  to  raw  device.  Measured  in 
MB/s,  higher  is  better. 


Linux  md,  the  software  RAID  driver  included  with  the  Linux  operating  system,  and  Gibraltar 
RAID.  All  measurements  were  taken  without  use  of  the  iSCSI  target,  instead  targeting  the 
base  devices.  With  both  reads  and  writes,  requests  are  submitted  to  the  array  starting  with 
offset  zero,  with  an  I/O  size  of  one  megabyte  (220  bytes).  The  operations  are  continued  for 
100  gigabytes  (100  x  230  bytes).  All  configurations  are  tested  in  normal  mode  (i.e.,  no  disks 
failed)  and  all  degraded  modes  (i.e.  at  least  one  disk  unavailable)  supported  by  the  RAID 
levels  tested.  Each  test  has  been  run  three  times,  and  the  maximum  bandwidths  are  reported. 
For  cases  with  high  variation,  minimum  bandwidths  are  also  reported. 

Each  array  was  configured  to  have  16  disks  of  data  capacity,  with  the  addition  of  the 
necessary  number  of  disks  to  support  the  RAID  level  implemented  (i.e.,  two  extra  disks  for 
RAID  6,  and  three  extra  disks  for  RAID  TP).  A  chunk  size  of  64KB  was  used.  With  Linux 
md,  the  array  was  built  with  the  “assume-clean”  option  because  md  would  otherwise  attempt 
to  calculate  all  parity  blocks  for  the  array.  While  the  data  in  the  md  array  will  not  be  consistent 
at  first,  the  inconsistency  does  not  cause  any  difficulty  with  the  tests  as  executed.  Writes  are 
also  performed  prior  to  reads,  so  all  data  is  consistent  when  read. 

Tests  for  both  Gibraltar  RAID  and  md  were  performed  on  the  same  server.  It  has  two 
quad-core  Intel  Xeon  X5550  CPUs  and  24  GB  of  DDR3  RAM  clocked  at  1333  MHz.  The 
GPU  used  is  an  NVIDIA  Tesla  C1060.  Samsung  Spinpoint  FI  HE103UJ  disks  are  connected 
to  an  LSISAS1068E  disk  controller. 

Table  4.1  gives  the  result  of  the  performance  testing.  One  notable  feature  is  that  Gibraltar 
RAID’s  read  bandwidth  is  slightly  lower  than  that  of  md  in  normal  mode.  This  is  expected, 
as  Gibraltar  RAID  performs  read  verification  with  all  available  parity,  which  consumes  more 
bandwidth.  As  the  above  configuration  is  bandwidth-limited  for  reads  by  the  disk  controller, 
the  bandwidth  consumed  by  reading  parity  slightly  reduces  the  performance.  However,  total 
bandwidth  can  be  calculated  for  Gibraltar  RAID  by  dividing  its  rate  by  the  number  of  data 
blocks  per  stripe  and  multiplying  by  the  number  of  total  blocks  per  stripe,  yielding  compa¬ 
rable  total  bandwidths  (achieving  99.2%  and  97.3%  of  md’s  rate  for  Gibraltar  RAID  6  and 
Gibraltar  RAID  TP,  respectively).  This  is  an  important  result:  Parity  verification  does  not 
significantly  reduce  the  speed  of  reads  in  the  above  configurations.  Only  the  extra  bandwidth 
requirement  for  parity  reads  causes  a  performance  decrease. 

One  notable  trend  is  that  the  variability  between  results  in  the  write  test  on  Gibraltar 
RAID  could  be  quite  high,  with  up  to  a  40%  difference  between  runs.  Read  bandwidth  tests 
had  almost  no  variability.  As  this  is  an  early  prototype  that  focuses  first  on  functionality  over 
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performance,  some  variations  are  to  be  expected.  We  estimate  that  the  performance  variation 
results  from  a  potential  serialization  based  on  memory  usage  constraints.  While  the  cache  is 
filling,  the  amount  of  free  memory  available  to  the  cache  decreases.  When  the  victimizer  is 
awakened  by  the  cache  as  it  runs  out  of  memory,  the  nearly  full  cache  will  cause  the  writes 
accepted  into  the  system  to  slow  to  the  rate  of  disk  and  GPU  coding  occurring  serially,  as 
this  is  the  rate  at  which  memory  is  freed  for  new  requests.  A  batch  must  proceed  through  the 
GPU  parity  generation  component  and  be  written  out  to  disk  before  a  new  set  of  writes  can 
be  accepted  into  the  memory  that  has  just  been  released.  In  many  situations,  this  condition 
does  not  occur  and  writes  proceed  at  the  same  speed  as  reads;  however,  this  is  not  always  the 
case,  and  the  tests  reflect  this  undesired  behavior.  This  can  be  considered  a  lesson  learned, 
and  an  opportunity  for  future  improvement. 

Degraded  mode  performance  for  Gibraltar  RAID  is  significantly  better  than  md,  showing 
no  performance  reduction  over  normal  mode.  In  fact,  performance  typically  improves  in 
degraded  mode  because  less  data  is  read  for  parity  verification.  In  comparison,  md  suffers 
a  performance  decrease  of  70%  for  reads.  Our  previous  work  in  developing  the  Gibraltar 
library  focused  on  making  coding  and  decoding  similar  operations,  allowing  degraded  mode 
to  be  as  fast  as  normal  mode  [7]. 

5.  Conclusion.  High-performance  software  RAID  with  commodity  hardware  assist  has 
the  potential  to  provide  an  economical  means  of  high-speed  storage  for  a  multitude  of  uses. 
Its  flexibility  provides  for  increased  capabilities  that  allow  it  to  enable  applications  that  even 
many  hardware  RAID  controllers  are  not  capable  of  servicing.  For  example,  environments 
that  are  particularly  harsh  and  space-constrained  can  apply  the  extra  parity  and  read  verifica¬ 
tion  to  provide  much  more  reliable  storage  in  the  face  of  difficult  conditions. 

However,  the  work  presented  in  the  paper  does  not  require  extraordinary  conditions  to 
show  benefits  to  users.  High-performance  streaming  I/O  is  a  basic  requirement  of  many  ap¬ 
plications,  but  typical  software  RAID  and  inexpensive  hardware  RAID  controllers  may  not 
be  capable  of  supporting  that  level  of  performance.  Many  applications  only  require  high- 
performance  streaming  I/O  and  data  reliability,  but  historically  only  expensive  RAID  con¬ 
trollers  can  provide  both  capabilities.  Gibraltar,  when  paired  with  an  economically  priced 
GPU,  can  also  fill  that  niche. 

This  work  demonstrates  that  parity  generation  and  data  recovery  after  disk  failure  can 
occur  at  line  speed  for  many  disk  installations  by  using  a  GPU.  Further,  as  Gibraltar  can 
verify  data  at  high  speeds,  users  are  protected  from  a  whole  class  of  errors  that  cannot  be 
prevented  without  similar  measures  in  other  controllers.  As  many  software  RAID  packages 
do  not  verify  reads,  the  type  of  technology  presented  here  would  be  a  natural  upgrade.  This 
is  a  significant  point  because  hardware  RAID  controllers  with  this  capability  are  much  more 
expensive  than  the  GPU  required  to  perform  these  computations.  Furthermore,  hardware 
RAID  controllers  that  do  support  read  verification  are  rare. 

While  this  work  currently  uses  GPUs,  GPUs  are  not  the  only  type  of  device  that  can  be 
used  to  calculate  parity.  As  technology  evolves,  different  computation  devices  will  be  pre¬ 
ferred  for  computation  of  parity  based  on  analysis  of  cost  versus  performance.  The  Gibraltar 
library,  the  underlying  library  for  Reed-Solomon  coding  and  decoding  for  Gibraltar  RAID,  is 
flexible  enough  to  allow  for  different  (or  multiple)  devices  to  be  tasked  with  parity  computa¬ 
tion.  The  changing  computational  landscape,  with  a  multitude  of  accelerators  and  CPU  types 
available,  will  be  well-served  in  the  future  by  a  user  space  RAID  infrastructure.  We  intend  to 
test  Gibraltar  with  a  variety  of  other  computation  devices  and  methods. 

6.  Future  Work.  It  has  been  shown  that  GPUs  are  fully  capable  of  sustaining  parity 
generation  and  data  reconstruction  beyond  the  rates  sustainable  by  the  software  RAID  imple¬ 
mentations  available  for  use  within  this  study.  However,  previously  mentioned  issues  such 
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as  memory  fragmentation,  pinning/mapping  of  memory,  and  direct  I/O  have  made  parity  op¬ 
erations  much  slower  than  they  could  be.  In  the  future,  we  plan  to  investigate  methods  to 
solve  such  memory  problems  within  the  controller.  Extra  computational  capacity  obtained 
from  such  improvements  could  be  used  to  provide  other  I/O-related  services  or  service  other 
applications  entirely. 

While  the  tests  in  this  paper  cover  the  speed  of  the  raw  device,  it  is  important  to  deter¬ 
mine  the  achievable  performance  through  the  means  of  access.  This  RAID  infrastructure  is 
currently  implemented  as  an  easily  integrated  modification  to  the  Linux  Target  Framework, 
which  allows  network  access  to  storage  via  iSCSI,  iSER,  FCoE,  and  other  protocols.  The 
current  task  with  top  priority  as  of  this  writing  is  to  integrate  machines  running  this  software 
into  clusters  with  high-speed  interconnects.  We  can  then  perform  multi-client  tests  over  a  va¬ 
riety  of  protocols.  Further  access  options  that  can  be  provided  include  a  small  driver  allowing 
direct  access  to  the  array  as  a  block  device. 
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SUPER-SCALE  REAL-TIME  NETWORK  SIMULATION  ON  THE  CRAY  XT 
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Abstract.  The  success  of  the  Internet  in  recent  decades  is  undeniable.  However,  it  is  equally  clear  that  the 
Internet  has  ossified  to  the  point  where  even  small  changes  to  its  structure  are  very  difficult  to  implement.  Many 
research  efforts  have  been  funded  in  recent  years  with  the  purpose  of  fostering  network  innovations.  GENI  is  a 
prime  example  of  the  tremendous  push  of  academia  and  industry  for  building  a  comprehensive  experimentation 
platform  where  structural  changes  to  a  future  Internet  can  be  tested  in  large-scale  settings.  In  this  report,  we  present 
our  progress  in  building  a  large-scale  real-time  network  simulation  testbed  on  Cray  XT  supercomputers.  Specially, 
we  detail  our  efforts  on  improving  the  scalability  of  PRIME  and  creating  a  testbed  environment  on  the  Cray  XT 
architecture.  Our  testbed  promises  network  researchers  an  environment  capable  of  running  experiments  at  orders  of 
magnitude  larger  than  any  current  testbed. 

1.  Introduction.  The  success  of  the  Internet  depends  on  continuing  innovations  in  the 
networking  research  community  and  on  the  successful  transition  of  laboratory  research 
projects  into  products  and  services.  However,  the  Internet  is  an  operational  environment 
with  distributed  ownership.  This  makes  the  Internet  a  difficult  place  to  effectively  prototype, 
deploy,  and  evaluate  new  designs  or  services  [14].  As  such,  experimental  testbeds  are  indis¬ 
pensable  tool  for  network  researchers. 

We  can  generally  classify  available  experimental  networking  research  testbeds  into  phys¬ 
ical,  emulation,  and  simulation  testbeds.  While  physical  and  emulation  testbeds  provide  the 
ability  to  conduct  extensive  live  experiments,  they  are  not  completely  satisfactory  on  all  ac¬ 
counts.  The  performance  and  scale  is  inherently  bound  to  the  physical  resources.  For  exam¬ 
ple,  PlanetLab  currently  supports  thousands  of  experiments;  the  facility  is  heavily  loaded  and 
is  already  stretching  the  capability  of  its  underlying  resources  [16].  Another  potential  issue 
is  the  lack  of  controllability.  Although  emulation  allows  flexible  network  experiments  (e.g., 
[2]),  emulated  network  conditions  are  subject  to  the  physical  setup  of  the  testbeds  (e.g.,  nodal 
processing  speed,  link  bandwidth  and  latency,  and  traffic  situation). 

Network  simulation  tools,  in  contrast,  allow  examining  large-scale  phenomena  more  ex¬ 
pediently.  However,  simulation  fairs  poorly  in  other  aspects,  particularly  in  its  operational 
realism.  Additionally,  model  development  is  labor  intensive  and  error-prone.  Reproducing 
realistic  network  topologies,  representative  network  traffic,  and  diverse  operational  conditions 
in  simulation  is  known  to  be  a  substantial  undertaking  [6].  Despite  this,  network  simulation  is 
effective  at  capturing  high-level  design  issues  and  providing  preliminary  analysis  of  complex 
system  behaviors. 

Emulab  [19]  and  GENI  [8]  are  currently  the  state  of  the  art  in  emulation  network  testbeds. 
These  testbeds  are  designed  as  shared  resources  where  many  experiments  are  concurrently 
run.  Users  of  these  testbeds  can  only  expect  to  gain  access  to  a  small  fraction  of  the  re¬ 
sources  at  any  given  time  because  of  this  multiplexing.  Most  experiments  are  of  moderate 
size  consisting  of  dozens  or  maybe  hundreds  of  compute  nodes  in  a  single  experiment. 

In  this  work  we  aim  to  create  an  experimental  testbed  that  combines  the  advantages 
of  both  simulation  and  emulation  which  is  capable  of  utilizing  the  resources  of  the  Cray 
XT  architecture.  Systems  such  as  the  Jaguar  supercomputer  at  ORNL  and  the  RedStorm 
supercomputer  at  SNL  offer  environments  with  tens  of  thousands  of  compute  nodes  connected 
by  high  speed  interconnects.  Furthermore,  these  systems  offer  far  more  computing  power  per 
compute  node  when  compared  with  the  state  of  the  art  clusters  mentioned  above.  We  have 
chosen  to  use  the  Parallel  Real-Time  Immersive  Modeling  Environment  [l](PRIME)  as  our 
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real-time  network  simulator.  PRIME  is  used  because  of  its  ability  to  execute  network  models 
in  parallel  on  both  distributed  and  shared  memory  architectures. 

Studying  computer  networks  and  network  applications  at  large-scales  is  critical  to  the 
design  and  implementation  of  next  generation  Internet  technologies.  The  combination  of  op¬ 
erating  system  virtualization,  network  simulation,  and  emulation  technologies  on  Cray  XT 
supercomputing  hardware  opens  the  possibility  of  executing  networking  experiments  that 
comprise  hundreds  of  thousands  of  network  hosts  and  routers.  This  allows  the  study  net¬ 
working  at  scale  and  in  a  fully  controlled  environment. 

Our  contributions  can  be  summarized  as  follows: 

•  We  identify  three  areas  that  can  limit  the  scalability  of  PRIME  on  HPC  systems  and 
present  solutions  for  two  of  these  issues. 

•  We  present  a  framework  for  running  a  real-time  simulation  testbed  on  the  Cray  XT 
architecture. 

The  remainder  of  this  report  is  structured  as  follows.  In  section  2  we  describe  the  con¬ 
cept  of  real-time  network  simulation  and  the  work  required  to  make  it  scale  to  thousands  of 
compute  nodes.  In  section  3  we  discuss  issues  we  encountered  and  progress  that  has  been 
made  in  running  OpenVZ  on  the  Cray  XT  architecture.  Lastly,  in  section  4  we  summarize 
some  our  work  and  detail  our  on-going  efforts. 

2.  Addressing  Issues  of  Scalability  in  Real-Time  Network  Simulation.  Real-Time 
network  simulation  allows  unmodified  applications  to  interact  with  simulated  models  in  real¬ 
time.  We  use  the  Parallel  Real-Time  Immersive  Modeling  Environment  [1  ](PRIME)  as  our 
real-time  network  simulator.  PRIME  is  built  on  top  of  the  Scalable  Simulation  Frame¬ 
work  [5](SXF);  a  standard  API  for  discrete  event  simulators.  PRIME  is  both  multi-threaded 
and  distributed.  Pthreads  are  used  to  run  PRIME  as  a  multi-threaded  application  on  shared- 
memory  multi-processors  and  MPI  is  used  to  run  PRIME  on  distributed-memory  machines. 
The  Cray  XT  architecture  is  constructed  of  many  compute  nodes,  each  with  one  or  more 
multi-core  CPUs  connected  by  a  high  speed  interconnect.  On  systems  such  as  this,  PRIME 
runs  as  a  multi-threaded  application  on  each  compute-node.  In  remote  compute  nodes,  it  uses 
MPI  to  coordinate  the  simulation  among  the  individual  simulation  instances. 

To  run  an  experiment  using  PRIME,  the  experimenter  must  develop  a  network  model. 
The  network  model  is  then  partitioned  into  N  partitions,  where  N  is  the  number  of  compute 
nodes  on  which  the  experiment  will  run.  An  instance  of  PRIME  will  run  on  each  compute 
node  and  execute  the  portion  of  the  network  model  described  in  its  associated  partition.  In 
addition  to  executing  its  portion  of  the  network  model,  each  simulation  instance  is  responsible 
for  importing  and  exporting  network  traffic  with  emulated  applications  running  on  the  virtual 
hosts  or  routers.  Emulated  applications  run  unmodified  on  the  compute  nodes.  Traffic  gener¬ 
ated  by  the  emulated  applications  is  captured  by  the  simulator  instance  and  is  injected  into  the 
virtual  network  where  the  traffic  is  subjected  to  the  conditions  of  the  virtual  network.  Like¬ 
wise,  when  traffic  in  the  virtual  network  is  targeted  at  an  emulated  application,  the  simulator 
exports  and  delivers  the  packet. 

Each  compute  node  may  run  many  emulated  applications.  The  primary  reason  each  com¬ 
pute  node  will  run  multiple  emulated  applications  is  that  the  network  model  is  typically  so 
large  that  even  on  a  very  large  machine,  the  number  of  emulated  applications  exceeds  the 
number  of  compute  nodes.  Each  application  expects  to  be  run  in  a  standard,  unshared  envi¬ 
ronment.  PRIME  opts  to  use  operating  system  virtualization  techniques  to  shield  the  applica¬ 
tions  from  the  fact  that  many  emulated  applications  are  multiplexed  onto  one  compute  node. 
Virtual  nodes  that  are  running  emulated  applications  will  be  assigned  their  own  virtual  envi¬ 
ronments.  Because  many  virtual  nodes  may  be  assigned  to  one  compute  node,  a  lightweight 
virtualization  platform  must  be  used.  For  this  project  we  have  chosen  OpenVZ  [13]  as  the 


Nathanael  Van  Vorst,  Kevin  Pedretti,  and  Ron  Oldfield 


311 


=) 

O) 


CO 


Network 
Applications 
Running  on  H2 


H 


Fig.  2.1.  PRIME  architecture.  A  virtual  network  is  mapped  onto  a  grid  of  scaling  units  where  each  unit 
simulates  a  portion  of  the  virtual  network.  Hosts  HI  &  H2  are  running  emulated  applications  that  are  run  in  separate 
OpenVZ  containers. 


platform.  OpenVZ  uses  containers  to  virtualize  system  resources.  A  container  is  an  isolated 
environment  in  which  applications  can  run  as  if  they  were  running  on  a  dedicated  system. 
Each  container  has  it’s  own  set  of  processes,  file  systems,  users  (including  root),  network 
interfaces  and  routing  tables.  Hardware  can  either  be  virtualized  and  shared  between  contain¬ 
ers,  or  a  single  container  can  be  given  exclusive  access.  An  OpenVZ  kernel  hosts  multiple 
containers  at  once  and  provides  a  mechanism  to  control  how  much  CPU,  RAM,  and  disk 
space  each  container  is  allowed  to  consume. 

The  high  level  real-time  simulation  infrastructure  for  PRIME  is  shown  in  Figure  2.1.  The 
infrastructure  is  composed  of  scaling  units.  A  scaling  unit  is  responsible  for  simulating  a  por¬ 
tion  of  the  virtual  network  and  managing  a  set  of  emulated  applications  which  are  co-hosted 
on  the  same  physical  node.  All  network  traffic  that  is  produced  by  an  emulated  application 
is  captured  by  the  simulator  and  injected  into  the  virtual  network.  Once  the  traffic  is  in  the 
virtual  network,  it  will  make  its  way  to  its  final  destination  which  may  be  another  emulated  or 
virtual  application.  We  have  identified  three  areas  of  concern  regarding  the  scalability  of  the 
PRIME  network  simulator:  (1)  routing  in  the  virtual  network;  (2)  the  performance  of  transfer¬ 
ring  packets  between  the  simulator  and  applications;  (3)  load  balancing  of  the  network  model 
among  scaling  units. 

The  remainder  of  this  section  is  organized  as  follows.  In  section  2.1  we  discuss  our 
efforts  in  addressing  the  amount  memory  required  to  route  in  the  virtual  network.  A  brief 
discussion  of  our  initial  efforts  at  addressing  the  performance  of  transferring  packets  between 
the  simulator  and  applications  is  found  in  section  2.2.  Finally  in  section  2.3  we  discuss  issues 
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with  load  balancing  the  execution  of  network  models  across  compute  nodes. 

2.1.  Routing.  The  current  implementation  of  PRIME  requires  that  a  static  forwarding 
table  for  all  hosts  or  routers  in  the  model  exist  on  each  scaling  unit.  The  space  complexity 
of  this  table  is  approximately  0{N2)  where  N  is  the  number  of  hosts  or  router  nodes  in  the 
network  model.  Lets  assume  that  each  scaling  unit  can  manage  about  102  hosts  and/or  routers. 
As  the  number  of  scaling  units  increases,  so  does  the  amount  of  memory  required  to  store  the 
routing  tables  on  each  scaling  unit.  Assuming  each  scaling  unit  can  manage  about  102  hosts 
and/or  routers,  we  find  that  when  there  are  104  scaling  units  the  entire  model  consists  of  about 
106  nodes.  We  also  see  that  there  would  be  10 12  routing  entries  per  scaling  unit.  Further 
assuming  each  routing  entry  took  just  a  single  byte,  it  would  still  take  over  931  gigabytes  per 
scaling  unit  just  to  store  routing  information  for  the  model. 

There  were  two  major  problems  with  the  previous  structure  of  routing  tables  in  PRIME. 
As  mentioned  previously,  the  first  problem  is  that  even  though  a  network  model  may  be 
partitioned  across  many  compute  nodes,  each  compute  node  must  have  access  to  the  global 
forwarding  table  (which  stores  the  routing  information).  For  a  shared  memory  machine  this 
issue  is  moot  since  one  forwarding  table  could  be  shared  among  all  of  the  compute  nodes. 
However,  for  a  distributed  memory  machine  the  issue  is  quite  severe.  The  second  problem  is 
the  space  complexity  of  the  forwarding  tables  was  0(N2).  We  have  devised  a  new  strategy 
for  routing  within  network  simulations  called  spherical  routing.  Spherical  routing  addresses 
both  of  these  issues. 

Spherical  routing  pre-calculates  the  forwarding  tables  and  conducts  routing  within  so- 
called  routing  spheres ,  each  with  a  user-specified  routing  strategy.  A  routing  sphere  is  defined 
internally  as  a  network  graph  -  with  vertexes  and  edges  -  on  which  a  single  routing  strategy 
is  applied.  A  routing  strategy  can  be  either  based  on  shortest  paths,  which  is  commonly  used 
for  intra-domain  routing  (such  as  OSPF  and  RIP),  or  based  on  routing  policies  (such  as  those 
used  in  BGP  for  inter-domain  routing).  Using  spherical  routing,  a  network  can  be  viewed  as 
a  hierarchy  of  routing  spheres:  a  routing  sphere  of  a  sub-network  is  enclosed  by  the  routing 
sphere  of  its  parent  network;  the  routing  sphere  of  the  sub-network  is  represented  as  a  vertex 
in  the  network  graph  of  the  parent  network’s  routing  sphere.  Within  each  sphere  a  static 
forwarding  table  is  calculated  according  to  its  routing  strategy  before  simulation,  and  during 
simulation  the  simulator  conducts  packet  forwarding  according  to  the  forwarding  table  and 
the  location  of  the  routing  sphere  with  respect  to  its  parent  network’s  routing  sphere. 

The  structure  of  spherical  routing  can  be  explained  using  the  following  example.  The 
user  specifies  a  network  model  in  a  hierarchical  fashion,  where  networks  serve  as  containers 
for  routers,  hosts,  links  and  sub-networks.  Fig.  2.2(a)  shows  a  network  consisting  of  two  sub¬ 
networks  with  the  same  network  structure  (Netl  and  Net 2),  connected  by  three  links  (LI, 
L2  and  L3).  Because  Netl  and  Net2  have  an  identical  structure  the  simulator  simply  marks 
Net 2  as  a  replica  of  Netl  in  its  internal  tree  representation  of  the  network,  as  shown  on  the 
right  of  Figure  2.2(b).  Netl  consists  of  two  sub-networks  (Net  3  and  Net4),  one  router  (Rl), 
and  two  links  (L4  and  L5).  Again,  Net4  is  simply  a  replica  of  Net3.  The  latter  consists  of 
four  identical  hosts  (HI  to  H4),  one  router  (R3),  and  four  links  (L6  to  L9).  H2,  H3  and  H4  are 
replicas  of  HI.  Hosts  and  routers  also  serve  as  containers  for  network  interface  cards  (NICs); 
we  ignore  them  in  the  figure  for  simplicity. 

For  each  network  in  the  model  the  user  can  assign  a  routing  sphere.  A  sub-network  can 
also  inherit  the  routing  sphere  of  its  parent  network.  For  network  entities  within  a  network,  the 
routing  sphere  is  called  the  owning  routing  sphere.  Each  routing  sphere  must  specify  a  routing 
strategy;  it’s  either  based  on  shortest  paths  or  routing  policies.  In  the  example,  all  routing 
sphere  are  assigned  a  shortest-path  routing  policy.  Netl  and  Net2  are  not  specified  with 
routing  spheres  and  their  network  graphs  are  flattened  into  the  network  graph  of  their  owning 
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Fig.  2.2.  An  example  network  is  shown  in  (a)  and  (b)  shows  the  internal  representation  of  the  same  model  with 
routing  spheres  and  replicated  substructures. 


routing  sphere  (topnet).  Net 3  through  Net 6  each  has  its  own  routing  sphere  specified  for 
shortest-path  routing,  although  the  routing  spheres  for  Net 4  though  Net 6  are  actually  replicas 
of  Net3. 

If  a  sub-network  is  assigned  a  routing  sphere,  the  sub-network  is  represented  as  a  “super¬ 
node”  in  the  network  graph  of  the  routing  sphere  for  the  parent  network.  The  routing  sphere 
for  the  parent  network  is  called  a  parent  routing  sphere',  the  routing  sphere  for  the  sub-network 
is  called  a  child  routing  sphere.  Like  routers,  a  super-node  can  have  multiple  network  inter¬ 
faces.  Unlike  routers,  the  distance  between  the  interfaces  on  the  same  super-node  is  not  al¬ 
ways  zero,  since  the  interfaces  may  actually  belong  to  different  hosts  or  routers.  In  this  case, 
the  distance  should  be  calculated  based  on  the  network  topology  and  the  associated  routing 
strategy  of  the  child  routing  sphere.  The  example  defines  five  routing  spheres,  SI  through  S5, 
as  shown  in  the  left  of  Figure  2.2(b).  Because  of  replication,  there  are  actually  only  two  types 
of  routing  spheres,  topnet  and  S2;  S2  through  S5  are  replicas  of  each  other. 

Due  to  space  considerations  we  omit  the  details  of  our  algorithm  to  route  packets  across 
and  between  spheres  at  run-time.  A  detailed  study  of  spherical  routing  is  presented  in  [17]. 
The  important  point  in  terms  of  this  report  is  that  once  a  packet  is  within  a  routing  sphere, 
the  simulator  is  able  to  route  the  packet  across  the  sphere  using  only  information  from  its 
forwarding  tables  and  the  table  of  its  parent.  This  means  that  we  do  not  need  to  include  all 
forwarding  tables  in  all  compute  nodes  (as  was  our  previous  approach).  For  example  if  we 
had  three  compute  nodes,  we  would  partition  the  model  into  three  partitions:  PI,  P2,  and 
P3.  Assume  that  PI  includes  Net3  and  Rl,  P2  includes  Net5  and  R2  and  P3  includes  Net4 
and  Net 6.  All  of  the  partitions  would  need  to  include  SI  because  our  forwarding  algorithm 
requires  that  routing  sphere  be  able  to  query  it’s  parent  sphere.  PI  would  need  to  include 
sphere  S2,  and  P2  would  need  to  include  S4.  P3  would  need  to  include  spheres  S3  and  S5, 
but  since  since  S3  and  S5  are  replicas  of  each  other  we  need  only  include  one  of  them. 

Spherical  routing  addresses  our  first  issue  by  breaking  the  single  large  forwarding  table 
into  a  hierarchy  of  smaller  forwarding  tables.  The  space  complexity  of  the  forwarding  tables 
is  addressed  in  two  ways.  First,  by  breaking  up  the  forwarding  table  into  a  hierarchy,  each 
table  in  the  hierarchy  is  much  smaller  than  the  single  table.  Second,  replicated  routing  spheres 
do  not  need  to  store  their  copy  of  their  forwarding  table  because  they  can  share  the  forwarding 
table  of  the  routing  sphere  which  they  replicate. 

There  has  been  much  research  into  reducing  the  space  complexity  of  routing  information 
for  network  simulation.  Of  the  previous  approaches,  hierarchical  routing  [9]  is  the  approach 
which  is  most  similar  to  spherical  routing.  Suppose  a  network  topology  consists  of  a  number 
of  domains,  a  few  clusters  within  each  domain,  and  a  few  nodes  in  each  cluster.  Each  node 
only  maintains  routing  information  for  nodes  within  its  local  cluster,  to  each  other  cluster 
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in  the  same  domain,  and  to  the  other  domains  in  the  topology.  By  storing  the  information 
is  this  way  the  average  space  complexity  is  0(kN  log^  N\  where  k  is  the  number  of  clusters 
per  domain,  and  log^  N  is  the  height  of  the  hierarchy.  This  provides  much  better  scalability 
than  a  flat  forwarding  table,  however  some  routes  will  be  in-accurate  in  comparison  to  global 
shortest  path.  These  in-accuracies  stem  from  the  fact  that  each  cluster  or  domain  can  only 
have  one  link  that  carries  traffic  between  other  clusters  or  domains.  Spherical  routing  has 
two  distinct  advantages  over  hierarchical  routing.  First,  spherical  routing  considers  all  links 
between  clusters  and  domains  which  allows  for  a  better  approximation  of  global  shortest 
path  routing.  However,  spherical  routing  does  introduce  some  minimal  path  inflation  when 
routing  across  multiple  spheres.  Second,  spherical  routing  can  take  advantage  of  structural 
similarities  to  reduce  the  space  complexity  of  the  forwarding  tables  far  more  than  hierarchical 
routing  alone. 

To  show  the  overall  effectiveness  of  spherical  routing  we  extended  one  of  the  tier-1  ISP 
topologies  provided  by  Rocketfuel  [15]  to  build  a  moderately  large  network  model.  The 
chosen  ISP  contains  640  routers  and  1,382  links.  We  use  a  graph  partitioner  to  partition  the 
ISP  network  into  16  clusters  of  similar  size.  Each  cluster  is  assigned  a  routing  sphere.  We 
then  attach  192  campus  networks  [12]  (each  having  17  routing  spheres)  evenly  distributed 
among  the  clusters,  resulting  in  a  network  model  with  a  total  of  103,935  hosts  and  routers  in 
289  routing  spheres  organized  in  5  levels.  For  spherical  routing,  the  total  memory  consumed 
by  the  forwarding  tables  measures  about  7.3  MB.  Our  previous  method  for  routing  would 
have  required  approximately  356  GB  to  store  the  forwarding  tables.  These  numbers  show  the 
promise  our  spherical  routing  strategy. 


Fig.  2.3.  The  current  PRIME  I/O  infrastructure.  Emulated  application  traffic  is  captured  by  TUN/TAP  devices 
on  host  machine  and  sent  to  the  simulator  via  the  emulation  gateway. 

2.2.  Packet  Transfer  Performance.  The  second  area  of  concern  is  the  speed  at  which 
the  simulator  can  exchange  packets  with  the  emulated  applications.  The  current  PRIME 
I/O  infrastructure,  seen  in  Figure  2.3,  uses  an  emulation  gateway  to  exchange  packets  with 
emulated  applications.  The  approach  has  the  advantage  that  the  simulation  of  the  virtual 
network  and  the  execution  of  the  emulated  applications  are  fully  decoupled.  However,  the 
overhead  required  to  exchange  packets  between  the  simulator  and  the  emulated  applications 
is  quite  high.  This  limits  the  bandwidth  to  about  a  lOOMB/s  [11]. 

To  cope  with  this,  we  designed  a  new  API  for  PRIME  which  can  support  a  diverse  set 
of  emulation  strategies.  The  new  API  introduces  the  concept  of  an  emulation  device  which 
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abstracts  how  PRIME  imports  and  exports  packets.  There  are  two  general  categories  of  em¬ 
ulation  devices.  The  first  type  of  device  is  an  interrupt  driven  device.  These  devices  require 
a  separate  thread  and  block  until  a  packet  is  ready  to  be  imported  or  exported.  The  second 
type  of  device  is  a  polling  device.  Polling  devices  do  not  require  a  separate  thread  for  execu¬ 
tion,  instead,  the  device  directly  schedules  when  it  should  be  allowed  to  import  and/or  export 
packets.  The  scheduling  is  most  often  periodic,  but  our  API  does  not  require  it. 

The  emulation  device  abstraction  allows  us  to  explore  new  methods  of  exchanging  im¬ 
port  and  exporting  packets.  Using  our  new  API  we  wrote  an  emulation  device  which  uses  an 
emulation  gateway  to  exchange  packets  with  emulation  applications.  We  are  currently  imple¬ 
menting  an  efficient  transport  which  directly  exchanges  packets  with  emulated  applications 
without  the  need  for  a  gateway. 

2.3.  Load  Balancing.  The  third  issue  is  related  to  the  partitioning  of  the  network  model. 
Currently,  the  model  is  partitioned  into  N  partitions  and  each  partition  is  assigned  to  run  on  a 
single  compute  node  (scaling  unit).  When  partitioning  the  model  we  opt  to  use  Metis  [10]  to 
partition  the  graph.  During  partitioning  we  maximize  the  delay  of  the  links  that  cross  between 
the  partitions  while  trying  to  balance  the  number  of  nodes  in  each  partition.  This  approach 
has  many  advantages,  however,  it  fails  in  a  number  of  significant  ways.  First,  it  assumes 
that  all  of  the  hosts  and  routers  in  the  model  take  a  similar  amount  of  effort  to  simulate. 
This  assumption  is  clearly  not  true.  The  amount  of  work  required  to  simulate  a  partition  is 
not  only  related  to  the  number  of  entities  in  the  partition  but  also  the  amount  of  traffic  that  is 
generated  within  partition.  A  small  portion  of  an  experiment’s  partitions  could  easily  generate 
a  majority  of  its  traffic,  thereby  causing  a  severe  load  imbalance.  Because  each  compute  node 
has  a  limited  amount  of  resources,  the  load  imbalance  may  cause  a  severe  over  subscription 
of  the  resources  on  some  of  the  compute  nodes.  This  hinders  PRIME’S  ability  to  simulate  the 
network  in  real-time.  Second,  the  partitioning  does  not  account  for  the  difference  between 
purely  simulated  network  entities  and  emulated  entities.  Emulated  entities  require  more  CPU 
resources  because  an  entire  virtual  environment  must  be  created  and  run  to  host  the  emulated 
applications.  Furthermore,  there  are  limits  as  to  the  number  of  emulated  entities  that  can  exist 
on  a  single  compute  node.  Due  to  the  limited  time  for  this  project,  we  opted  to  not  to  address 
this  issue.  We  only  mention  it  as  an  open  issue  that  needs  resolving. 

3.  A  Testbed  for  Large-Scale  Real-Time  Simulation  on  Cray  XT.  The  Cray  XT  ar¬ 
chitecture  is  a  platform  used  by  many  DOE  laboratories  (e.g.  SNL,  ORNL,  ANL)  to  run 
large-scale  parallel  distributed  simulations.  The  PRIME  Real-Time  simulation  infrastructure 
in  Figure  2.1  requires  two  things:  (1)  a  collection  of  compute  nodes  that  are  connected  by  a 
high  speed  network  and  (2)  a  virtualized  environment  to  run  applications.  The  Cray  XT  is  a 
very  good  match  in  terms  of  the  first  requirement.  The  second  requirement  is  where  the  Cray 
XT  is  not  yet  adequate.  Virtualization  is  needed  for  two  reasons.  First,  emulated  applica¬ 
tions  need  to  be  run  in  a  “standard”  environment  and  should  require  no  modifications  to  run. 
Second,  in  order  to  support  the  execution  of  network  models  where  there  are  more  applica¬ 
tions  than  compute  nodes,  it  must  be  possible  to  run  multiple  emulated  applications  on  a  each 
compute  node.  Virtualization  is  clean  solution  that  allows  us  to  run  multiple  applications  in 
isolated  environments  on  the  same  hardware. 

The  default  operating  system  for  the  compute  nodes  of  Cray  XT  supercomputers  is  a 
modified  version  of  the  SUSE  SLES  distribution  called  Compute  Node  Linux  [18]  (CNL). 
CNL  has  many  advantages  over  a  vanilla  Linux  distribution  but  there  are  a  few  serious  draw¬ 
backs.  First,  CNL  has  been  heavily  modified  and  is  difficult  to  compile  or  modify.  Second, 
because  of  the  propriety  extensions  introduced  by  Cray,  CNL  is  essentially  a  closed  source 
environment.  These  draw  backs  have  lead  to  the  development  of  a  patch  for  vanilla  Linux 
kernels  to  run  on  the  Cray  XT  hardware.  The  patch  is  primarily  composed  of  an  open  source 
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driver  for  the  Cray  SeaStar  and  Resiliency  Control  Agent  (RCA)  hardware.  The  RCA  is  re¬ 
sponsible  for  routing  hardware  events  and  console  output  to  the  login  node(s).  The  SeaStar 
driver  emulates  Ethernet  over  the  SeaStar  interconnect.  Details  of  the  Ethernet  emulation  are 
discussed  in  section  3.1. 

The  basic  idea  behind  OpenVZ  is  to  support  separate  process  name  spaces  and  resource 
isolation  and  control.  OpenVZ  is  available  as  a  patch  to  vanilla  Linux  kernels.  The  latest 
patch  is  for  the  2.6.32  kernel  and  is  quite  large,  currently  132,710  lines.  Rather  than  back- 
porting  the  OpenVZ  patch  to  match  the  our  Cray  XT  patch,  we  ported  the  Cray  patch  to  the 
2.6.32  kernel.  The  port  of  the  Cray  XT  patch  was  straightforward.  Modifying  the  OpenVZ 
patch  to  work  with  our  Cray  XT  patch  was  also  fairly  straight  forward.  We  did  however  run 
into  difficulties  with  a  number  of  bugs  in  the  OpenVZ  kernel  which  were  exposed  by  not 
using  the  default  set  of  kernel  options. 

Modern  Linux  distributions  often  require  hundreds  of  megabytes  of  disk  space  for  even 
the  most  bare-bones  of  installs.  Compute  nodes  are  run  disk-less.  Their  root  file  system  is 
an  in-memory  file  system,  a  so  called  initramfs.  In  order  to  minimize  the  amount  of  memory 
that  is  used  for  the  initramfs  we  opted  to  build  our  own  distribution  of  Linux  based  on  Busy- 
Box  [4].  BusyBox  is  a  minimal  Linux  distribution  that  is  commonly  used  for  embedded  Linux 
devices.  BusyBox  provides  many  of  the  system  tools  one  would  expect  in  a  full-fledged  distri¬ 
bution  but  there  are  some  exceptions.  It  is,  for  instance,  expected  the  applications  be  statically 
compiled  as  the  distribution  does  not  come  with  any  pre-compiled  dynamic  libraries. 

In  addition  to  the  kernel  patch,  OpenVZ  requires  a  set  of  tools  to  create  and  manage 
containers.  The  OpenVZ  tool  set  is  a  combination  of  c-programs  and  shell  scripts  which 
coordinate  construction  and  control  of  containers.  The  shell  scripts  make  extensive  use  of 
system  utilities  such  as  awk,  grep,  and  df.  BusyBox  provides  all  of  these  tools  as  a  single 
static  binary  and  wraps  the  binary  with  scripts  to  make  it  appear  as  if  there  were  distinct 
applications.  It  turns  out  that  the  BusyBox  implementation  of  many  of  the  system  utilities  is 
either  incomplete  or  not  strictly  compatible  with  the  standard  (or  common)  implementations. 
We  were  forced  to  rework  most  of  the  scripts  related  to  the  construction  of  containers  in 
response  to  these  incompatibilities. 

OpenVZ  has  a  concept  called  templates.  A  template  is  a  base  install  of  a  specific  Linux 
flavor,  including  all  of  the  system  files  for  that  distribution.  There  are  templates  for  Ledora, 
RedHat,  Ubuntu,  Suse,  CentOS,  as  we  well  as  many  others.  When  a  new  container  is  installed, 
the  system  basically  copies  the  files  from  a  template  into  a  directory  for  that  container.  Lor 
each  container  that  is  installed,  an  additional  copy  of  these  files  must  be  made.  The  average 
size  of  a  template  is  around  150  megabytes.  It  is  not  feasible  to  store  files  for  hundreds 
containers  in  the  initramfs;  there  would  very  little  memory  remaining  to  run  the  simulator  or 
applications.  Instead  we  opted  to  use  a  union  file  system  which  allows  us  to  store  a  single 
read-only  copy  of  the  system  files  and  separately  store  modifications  each  container  makes 
to  the  read-only  files.  However,  the  union  file  system  alone  is  not  adequate.  The  whole  point 
of  having  containers  is  so  applications  can  be  run  in  the  containers  during  an  experiment. 
The  reason  to  run  experiments  is  to  gather  data.  Applications  in  each  container  will  generate 
data  of  possibly  significant  volumes.  We  needed  somewhere  to  store  the  experimental  data 
generated  by  the  simulator  and  applications.  This  led  us  to  use  a  shared  file  system  to  store 
the  experimental  data.  Section  3.2  details  the  structure  of  the  file  system  for  each  compute 
node. 

3.1.  Networking.  The  SeaStar  is  basically  a  small  switch  with  seven  ports.  One  port  is 
connected  to  a  host  CPU.  The  other  six  ports  are  connected  to  SeaStar  chips  on  other  compute 
nodes.  In  general,  the  compute  nodes  are  connected  in  a  3D  torus.  Each  SeaStar  in  the  torus 
is  loaded  with  a  routing  table.  When  a  message  is  received  from  a  neighboring  SeaStar,  a 
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check  is  made  to  see  if  the  message  is  destined  for  the  connected  host.  If  the  local  host  is  not 
the  target,  another  table  look  up  is  done  to  ascertain  out  of  which  port  to  forward  the  message. 
If  the  message  is  meant  for  the  attached  CPU,  the  SeaStar  raises  an  interrupt  to  the  CPU  and 
sets  up  a  DMA  transfer  to  the  host’s  memory. 

SNL’s  open  source  SeaStar  Ethernet  emulation  driver  allows  the  standard  Linux  TCP/IP 
stack  to  use  the  SeaStar  as  it  would  any  Ethernet  device.  In  essence,  fully  formed  Ethernet 
frames  are  given  to  the  driver  to  be  delivered.  The  driver  re-writes  the  Ethernet  header  with 
a  SeaStar  header  by  mapping  the  MAC  addresses  to  SeaStar  addresses.  It  is  expected  that 
the  MAC  address  of  a  SeaStar  encodes  the  underlying  SeaStar  address.  Additionally,  the 
driver  does  not  support  Ethernet  broadcast.  As  such,  ARP  cannot  be  implemented.  Instead 
of  running  ARP,  the  ARP  tables  are  preloaded  at  boot  with  all  of  the  IPs  and  MACs  of  the 
SeaStars  in  the  torus.  Section  3.3  discusses  some  of  the  details  of  the  IPs,  MACs,  and  ARP 
tables  are  configured  at  boot  time. 

The  SeaStar  supports  message  sizes  that  are  much  larger  than  the  standard  Ethernet  frame 
size  of  1500  bytes.  The  default  MTU  for  the  driver  is  16000  bytes  (  16KB).  Because  the 
SeaStars  raises  interrupts  for  each  message  -  which  in  this  case  maps  to  a  frame  -  the  frame 
size  can  have  an  impact  of  the  performance  of  the  driver.  Using  the  default  driver,  we  were 
able  to  get  about  1.86  GB/s  between  two  nodes  using  iperf’s  TCP  bandwidth  test.  A  previous 
study  measured  the  peak  bandwidth  of  the  SeaStar  using  portals  at  2.2  GB/s  [3].  Given  the 
overheads  of  the  TCP/IP  stack  we  are  pleased  with  the  available  bandwidth. 

3.2.  File  System.  The  root  file  system  of  the  compute  nodes  is  an  in-memory  file  system 
( initramfs )  because  the  compute  nodes  do  not  have  attached  storage  devices.  It  is  therefore 
important  to  keep  the  initram  small  because  any  memory  used  for  the  initram  is  not  available 
for  applications. 

As  mentioned  previously,  the  storage  required  for  the  OpenVZ  containers  can  be  sub¬ 
stantial.  To  minimize  the  amount  of  space  required  for  the  containers  we  decided  that  we 
would  only  support  a  single  type  of  container.  Since  all  containers  will  use  the  same  template 
image,  a  majority  of  the  files  will  be  identical.  If  we  are  able  to  keep  a  single  copy  of  all  of  the 
files  that  are  the  same  between  all  the  containers  we  can  save  a  substantial  amount  of  space. 
We  use  funionfs  [7],  a  FUSE  based  union  file  system.  Funionfs  allows  for  two  directories  to 
be  “unioned”  and  presented  as  a  single,  mountable,  file  system.  One  directory  is  set  to  be  the 
read-only  and  the  other  is  set  to  be  the  read-write.  When  a  file  is  accessed,  if  the  file  exists 
in  the  read-write  directory  the  read-write  version  is  used.  If  the  file  is  not  in  the  read-write 
directory,  the  read-only  version  is  returned.  When  read-only  files  are  modified,  the  modified 
version  is  saved  in  the  read-write  directory.  By  using  funionfs  we  are  able  to  store  a  single 
template  for  all  containers,  and  only  store  the  system  files  which  are  modified  during  the  ex¬ 
ecution  of  each  of  the  individual  containers.  As  part  of  our  reworking  of  the  vzctl  scripts  we 
modified  them  to  use  funionfs  when  creating  and  starting  containers. 

Funionfs  does  not  entirely  solve  the  storage  problem.  Files  that  are  unique  to  a  container 
must  be  stored  separately  in  the  read- write  directory.  As  mentioned  earlier,  it  is  expected  that 
applications  running  within  a  container  will  generate  data.  The  data  should  be  available  to 
the  experimenters  after  an  experiment  so  the  data  must  also  be  persistent.  Typically  a  parallel 
file  system  such  as  Lustre  or  Panasas  is  used.  We  have  three  target  environments  on  which  to 
run  our  system.  The  first  system,  xtp,  is  an  XT5  with  around  1024  compute  nodes  and  uses  a 
Panasas  based  shared  file  system.  The  second  system,  rsopen  is  a  much  smaller  32  node  XT4 
running  a  Lustre  based  shared  file  system.  The  end  goal  is  to  run  PRIME  on  a  very  large  XT 
system  such  as  Jaguar.  Jaguar  is  an  XT5  with  around  18K  compute  nodes  running  a  Lustre 
based  file  system.  There  are  known  issues  with  using  simpler  shared  file  systems  such  as  NFS 
so  our  initial  goal  was  to  leverage  one  of  these  parallel  file  systems. 
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It  turns  out  that  our  ambition  was  greater  than  our  ability.  All  of  the  systems  have  highly 
customized  installations  of  either  Panasas  or  Lustre.  There  was  no  obvious  path  to  supporting 
Panasas  or  Lustre  in  one  -  much  less  all  -  of  our  target  environments.  However,  we  are  able 
to  easily  re-export  a  Panasas  or  Lustre  file  system  using  NFS.  The  issue  is  that  a  single  NFS 
server  will  be  hard-pressed  to  handle  hundreds,  and  certainly  not  thousands  of  clients.  To 
solve  that  issue  we  decided  we  could  use  a  network  of  NFS  servers.  The  exact  number  of 
NFS  servers  will  depend  on  the  particular  environment.  We  plan  on  running  tests  on  our  xtp 
system  to  determine  how  many  clients  an  NFS  server  can  reasonably  support. 

3.3.  Initialization.  Compute  nodes  use  a  customized  initialization  process  during  boot. 
Our  customized  initialization  script  sets  up  the  SeaStar  device,  populates  the  ARP  tables, 
mounts  shared  file  systems,  and  prepares  the  system  to  create  and  run  OpenVZ  containers. 
Each  compute  node  is  assigned  a  node  id  ( nid ).  The  previously  mentioned  RCA  driver  exports 
the  nid  via  the  proc  file  system  at  /proc/cray_xt/nid.  In  addition  to  the  nid,  there  are 
two  static  files:  /etc/nids  and  /etc/servers.  The  nids  file  contains  the  IP  and  MAC 
addresses  of  all  compute  nodes  in  the  system  and  is  used  to  populate  the  ARP  cache.  The 
local  SeaStar  IP  is  calculated  as  follows:  IP  =  192.168.(A/7D/254).(1  +NID  mod  254).  The 
MAC  address  is  simply  the  NID  (in  hex)  followed  by  four  zeros  and  prepended  with  enough 
zeros  to  make  a  valid  MAC  address. 

After  the  network  is  setup,  the  prime-ovz  export  is  mounted  from  the  NFS  server  specified 
in  the  servers  file.  The  prime-ovz  share  is  expected  to  have  three  directories:  template, 
nodes,  and  exp.  The  base  template  images  are  stored  in  the  template  directory.  Within  the 
nodes  directory  each  compute  node  creates  node_NID  where  the  compute  node  stores  all  its 
experimental  data.  The  exp  directory  is  where  users  place  their  network  models  and  where 
PRIME  will  store  data  generated  by  the  simulator  itself. 

4.  Conclusion.  Over  the  course  of  this  project  we:  (1)  successfully  ported  and  ran 
OpenVZ  on  a  Cray  XT4’s  compute  nodes;  (2)  designed  an  emulation  infrastructure  for  PRIME 
and  started  an  emulation  driver  for  the  Cray  XT;  and  (3)  implemented  and  evaluated  a  new 
routing  strategy  for  large-scale  network  simulation.  In  the  coming  months,  we  hope  to  fin¬ 
ish  the  implementation  of  our  testbed  at  which  time  we  plan  to  design  and  run  a  large-scale 
network  experiment  on  the  Cray  XT5  machine  at  SNL  (xtp). 
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APPLICATION  SUPPORT  FOR  RESILIENCE  IN  EXASCALE  SYSTEMS 

MARIA  RUIZ  VARELA*,  KURT  B.  FERREIRA^,  AND  ROLF  RIESEN* 

Abstract.  Application-driven,  coordinated,  checkpoint-to-disk  is  the  most  prevalent  fault  tolerance  method  used 
in  modern,  large  scale,  high  performance  computing  (HPC)  systems.  As  the  node  count  increases,  the  inherent  syn¬ 
chronization  of  coordinated  checkpoint/restart  (CR)  is  expected  to  seriously  limit  overall  application  performance  to 
a  point  were  applications  would  spend  most  of  the  time  checkpointing  instead  of  executing  useful  work.  As  HPC 
systems  grow  in  size,  unscalable  resilience  methods  and  increased  failure  rate  will  progressively  hinder  application 
performance,  hence  the  need  for  evaluating  resilience  performance  at  exascale.  Since  exascale  systems  are  not  avail¬ 
able  yet,  we  choose  simulation  as  performance  evaluation  method  to  run  experiments  at  the  million-core  level.  Our 
choice  of  simulation  platform,  the  Structural  Simulation  Toolkit  (SST),  is  an  event-driven  framework  that  integrates 
models  for  different  system  components,  such  as  processing  elements,  memory,  and  the  network.  To  accurately 
characterize  resilience  overhead,  we  compare  the  performance  of  several  resilience  methods  by  extracting  communi¬ 
cation  patterns  of  representative  scientific  applications  and  code  them  in  a  state  machine  representation  suitable  for 
SST.  In  this  work,  we  propose  a  new  simulation  method  that  will  enable  us  to  identify  the  most  efficient  resilience 
mechanisms  for  future  exascale  systems  and  their  applications. 


1.  Motivation.  Coordinated,  application-directed,  checkpoint-to-disk,  is  widely  used 
as  resilience  method  in  modem  parallel  systems  to  enable  applications  to  progress  in  spite  of 
hardware  and  software  faults.  Future  exascale  machines  are  predicted  to  have  hundreds  of 
thousands  of  nodes  and  millions  of  cores[l],  and  the  number  of  cores  per  chip  is  expected  to 
grow  with  Moore’s  law.  Additionally,  since  the  system  failure  rate  grows  proportionally  to 
the  number  of  components,  exascale  machines  are  expected  to  experience  more  failures  than 
current  systems.  Consequently,  as  the  node  count  and  number  of  cores  per  chip  increases, 
the  failure  rate  will  be  such  that  applications  would  spend  most  of  its  time  checkpointing, 
restarting  from  failures,  and  recomputing  lost  work,  rather  than  making  progress  towards  the 
solution.  Even  under  ideal  conditions,  applications  executing  on  more  than  50,000  nodes 
could  spend  more  than  half  of  their  total  running  time  on  checkpointing  overhead  [11]. 

Currently  there  is  no  evidence  indicating  that  mean  time  to  fail  (MTTF)  of  the  individual 
components  that  comprise  a  node  will  improve  in  the  near  future.  On  the  contrary,  there 
is  evidence  that  suggests  manufacturing  processes  improvements  for  future  chips  will  keep 
failure  rates  per  bit  similar  to  that  of  current  systems  [18].  This  expected  component  constant 
failure  rate  and  the  dramatic  increase  in  component  count  will  lead  to  a  much  higher  system 
MTTF  for  next  generation  systems. 

Figure  1.1  depicts  the  run  time  of  a  168-hour,  weak-scaling  application  executing  in  a 
system  with  a  single  node  mean  time  between  failures  (MTBF)  of  43,800  hours  (5  years), 
checkpoint  time  of  5  minutes,  restart  time  of  10  minutes,  and  optimal  checkpoint  interval. 
For  node  count  beyond  50,000  only  a  small  fraction  of  the  total  elapsed  time  corresponds  to 
actual  useful  computation  time.  This  rapidly  increasing  overhead  is  caused  by  the  amount  of 
lost  work  and  the  number  of  restarts  the  application  experiences  as  the  failure  rate  grows. 

In  this  work,  we  explain  the  simulation  method  we  propose  to  evaluate  resilience  perfor¬ 
mance  at  extreme  scale.  In  order  to  run  the  simulation  experiments,  we  need  to  design  the 
different  components  that  need  to  be  integrated  into  the  event-driven  SST  framework;  namely 
the  network,  compute  cores,  and  communication  application  patterns.  The  network  compo¬ 
nent  is  designed  to  simulate  realistic  network  traffic.  The  communication  pattern  component 
we  designed  to  plug  into  SST  generates  network  traffic  without  incurring  the  processing  and 
memory  overhead  of  executing  a  full  endpoint  simulator.  The  compute  core  component  is 
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Fig.  1.1.  Overhead  example  for  a  168-hour  application  [11] 


represented  by  a  message  pattern  generator  that  transmits  messages  and  implement  differ¬ 
ent  resilience  methods.  Message  patterns  are  important  because  the  performance  of  some 
resilience  mechanisms  depend  on  message  exchange  behavior.  For  example,  in  the  case  of 
uncoordinated  checkpointing  with  message  logging,  only  if  we  know  at  which  point  in  time 
messages  arrive,  how  large  they  are,  and  which  core  sent  them,  will  we  be  able  to  faithfully 
emulate  message  logging,  keep  track  of  the  log  size  and  logging  overhead,  and  send  recovery 
requests  to  the  appropriate  core.  Communication  patterns  are  coded  as  state  machine,  which 
is  a  representation  suitable  for  SST.  The  state  machine  for  an  application  pattern  will  contain 
additional  states  to  provide  different  resilience  functionality.  The  rest  of  the  paper  is  orga¬ 
nized  as  follows.  Section  2  provides  definitions  and  a  classification  of  the  resilience  methods 
we  will  compare  in  our  experiments.  In  Section  3  we  describe  the  simulation  approach  we 
propose,  the  components  of  the  simulation  framework,  and  our  choice  of  design  parameters 
to  feed  it.  In  Section  4  we  detail  the  experiment  plan,  and  in  Section  5  we  provide  a  current 
status  of  our  work  and  conclude. 

2.  Background.  Rollback-recovery  fault  tolerance  protocols  can  be  checkpoint-based 
or  log-based.  Based  on  the  type  of  coordination  that  occur  among  processes,  checkpoint- 
based  protocols  can  be  coordinated,  uncoordinated,  or  communication  induced.  Based  on 
the  frequency  with  which  the  message  log  is  saved,  log-based  methods  can  be  pessimistic, 
optimistic,  or  causal  [9]. 

Several  optimizations  to  coordinated  checkpoint/restart  (CR),  designed  to  ameliorate 
some  of  its  negative  effects,  have  been  proposed.  Some  examples  include  incremental  [17] 
[5],  non-blocking  [8],  diskless  [16]  [10],  and  RAID-like  distributed  and  multi-level  check¬ 
pointing  [14]. 

In  coordinated  CR  all  processes  save  their  checkpoint  state  synchronously  creating  a 
consistent  global  state,  which  facilitates  recovery.  Its  main  drawback  is  the  strain  this  syn¬ 
chronization  puts  in  the  stable  storage  making  storage  system  bandwidth  the  bottleneck  of  the 
operation.  Also,  even  if  only  one  process  fails,  the  entire  application  needs  to  restart  from  the 
last  stable  checkpoint. 

In  contrast  to  coordinated  CR,  uncoordinated  CR  only  restarts  the  failed  processes.  Un- 
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coordinated  CR  logs  messages  at  the  sender  side  and  in  case  of  failure,  replays  the  logged 
messages  necessary  to  bring  the  restarted  processes  to  a  consistent  state  with  respect  to  the 
rest  of  the  processes.  Its  main  disadvantages  are  the  storage  space  required  to  log  the  mes¬ 
sages,  a  considerably  more  complex  restart,  the  need  for  garbage  collection  of  message  logs 
as  the  application  execution  progresses,  and  for  optimistic  protocols,  the  possibility  of  roll¬ 
back  propagation,  which  could  cause  all  processes  to  restart  in  the  worst  case.  Although  it 
has  been  proposed,  uncoordinated  CR  has  not  been  implemented  in  HPC  systems. 

In  addition  to  the  classical  roll-back  recovery  protocols,  there  are  mechanisms  such  as 
redundant  computing  that  enable  applications  to  continue  executing  if  failures  occur.  Re¬ 
dundant  computing  has  been  long  employed  in  mission-critical  [13]  [4]  [15]  and  distributed 
systems  [12]  [6],  but  has  been  dismissed  for  HPC  systems  due  to  its  perceived  high  cost.  In 
Redundant  computing  each  process  is  replicated  a  number  of  times  across  the  system.  It  in¬ 
creases  resilience  application  by  increasing  the  application  mean  time  between  interruptions 
(MTBI).  An  application  can  make  uninterrupted  progress  towards  useful  work  provided  that 
at  least  one  process  in  the  replication  pool  is  functional.  Unlike  rollback-recovery  protocols 
were  a  single  failure  causes  an  interrupt,  a  redundant  application  tolerates  multiple  failures 
until  redundancy  is  exhausted,  avoiding  recovery  overhead.  This  overhead  reduction  has  the 
potential  of  increasing  the  application  MTBI  and  checkpoint  interval,  reducing  time  to  solu¬ 
tion  and  increasing  system  throughput.  At  large  node  counts,  these  gains  can  offset  the  cost 
of  extra  nodes  and  the  additional  power  and  cooling  requirements.  [11]. 

Diskless  techniques  have  been  proposed  as  an  optimization  to  coordinated  CR  to  store 
checkpoints  more  efficiently  [7]  [16]  [20].  One  example  are  memory  distributed  mechanisms 
that  save  checkpoint  state  redundantly  across  a  distributed  system  in  a  RAID-like  manner, 
and  write  it  to  stable  storage  only  when  a  failure  occurs  [14].  Its  main  drawback  is  the  nearly 
doubled  amount  of  memory  required  to  store  both  the  application  state  and  the  checkpoint 
file.  Its  main  benefit  is  the  ability  to  tolerate  multiple  failures  as  long  as  at  least  one  of  the 
nodes  that  store  the  checkpoint  file  and  its  copy  remains  functional. 

Table  2.1  lists  the  resilience  methods  to  compare  as  part  of  the  simulation  method  pro¬ 
posed  in  this  work. 


Resilience  mechanisms 

1 .  Coordinated  CR 

2.  Uncoordinated  CR  with  optimistic  message  logging 

3.  Uncoordinated  CR  with  pessimistic  message  logging 

4.  Distributed  CR  with  RAID-like  storage  in  remote  node  memory 

5.  Redundant  computing 


Fig.  2.1.  Resilience  mechanisms. 


3.  Approach.  Our  goal  is  to  learn  how  the  various  methods  that  provide  application  re¬ 
silience  perform  at  exascale.  Since  these  systems  are  not  available  yet,  we  choose  simulation 
to  run  experiments  at  the  million-core  level.  Our  choice  of  simulation  platform,  the  Struc¬ 
tural  Simulation  Toolkit  [2],  is  an  event-driven  simulation  framework  that  integrates  models 
for  different  system  components,  such  as  processor,  memory,  and  the  network.  In  order  to  run 
such  large  simulations,  the  network  and  compute  core  components  integrated  into  SST  need 
to  be  simple  and  frugal  in  memory  consumption. 

Concerning  resilience  methods,  some  of  them  might  not  scale  because  of  the  large  num¬ 
ber  of  aggregate  faults  expected  to  occur  in  exascale  systems;  thus,  we  will  simulate  faults  and 
force  the  pattern  generators  to  recover  before  they  can  continue.  Recovery  overhead  might 
become  prohibitively  large  for  some  resilience  methods  at  exascale.  Furthermore,  the  over- 


M.R.  Varela,  K.B.  Ferreira  and  R.  Riesen 


323 


head  might  depend  on  the  message  pattern,  thus  be  application  contingent.  Our  simulation 
experiments  will  enable  us  to  evaluate  these  scenarios  setting  various  parameters  to  configure 
the  message  pattern  generators,  that  is,  the  various  delays  that  govern  writing  and  retrieving 
checkpoint  and  log  data,  and  the  duration  of  computation  phases.  We  will  adjust  these  param¬ 
eters  to  get  as  close  as  possible  to  measurements  obtained  from  modern  large-scale  systems 
executing  real  applications.  The  following  subsections  describe  our  design  choices  for  the 
different  SST  components. 

3.1.  Network.  We  model  the  network  component  as  a  two-dimensional  torus  intercon¬ 
nection  with  end- around  connections.  A  torus  interconnect  network  is  a  mesh  network  widely 
used  in  HPC  systems.  In  this  topology,  each  processor  is  connected  to  four  neighbors,  which 
simplifies  the  implementation  of  grid  application  patterns  that  exchange  messages  between 
subgrids.  The  number  of  rows  and  number  of  columns  are  powers  of  two.  This  design  con¬ 
straint  simplifies  the  collective  algorithms  implementation.  Instead  of  simulating  a  router,  we 
opt  for  a  simplified  model  that  forwards  messages  based  on  the  source  route  computed  at  the 
sending  core.  Messages  are  wormhole-routed,  i.e.  a  single  message  occupies  the  input  and 
output  ports  at  a  time.  The  routers  model  internal  congestion  at  the  output  ports  by  blocking 
incoming  messages,  however,  there  is  no  flow  control  among  routers;  thus,  network  conges¬ 
tion  cannot  be  modeled.  For  many  simulation  studies,  it  is  important  to  have  realistic  network 
traffic,  but  computation  at  the  endpoints  is  of  limited  relevance.  The  communication  pattern 
component  of  SST  allows  the  generation  of  network  traffic  without  incurring  the  processing 
and  memory  overhead  of  running  a  full  endpoint  simulator. 


3.2.  Compute  Cores.  For  now,  each  node  consists  of  a  single  core,  and  each  core  is 
represented  by  a  message  pattern  generator.  The  total  number  of  cores  as  well  as  the  number 
of  cores  per  router  must  be  a  power  of  two.  As  in  the  network  case,  this  restriction  greatly 
simplifies  the  collective  algorithms  implementation.  Simulating  a  full-fledged  CPU  core  at 
exascale  would  be  prohibitively  expensive.  For  our  purposes,  a  much  faster  approach  will 
suffice,  thus,  we  create  message  pattern  generators  that  transmit  messages  and  implement  the 
resilience  methods  listed  in  Table  2.1. 

3.3.  Application  Patterns.  Application  patterns  encapsulate  code  key  features  and  ex¬ 
tract  application  communication  patterns.  We  implement  four  message  patterns  that  resemble 
the  messaging  behavior  of  a  set  of  representative  scientific  applications:  (1)  Ghost  cell,  (2) 
Master-Slave,  (3)  Integer  Sort  (IS),  and  (4)  Fast  Fourier  Transform  (FT)  patterns.  In  order  to 
instantiate  these  patterns  as  objects  in  SST,  we  implement  them  as  state  machines,  which  is  a 
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representation  suitable  for  the  event-driven  simulation  environment.  Instead  of  sending  mes¬ 
sages,  the  pattern  generators  send  events  to  the  appropriate  destination  core  informing  it  that 
a  message  is  on  its  way,  along  with  the  size  of  that  message.  Local  computation  is  emulated 
by  simply  waiting  for  a  specified  amount  of  time. 

Each  pattern  generator  is  implemented  as  a  state  machine.  They  simulate  compute  time 
by  suspending  operations  until  a  future  event  indicates  elapsed  time  and  the  need  to  transition 
to  another  state  in  the  state  machine.  The  state  machines  contain  states  for  waiting  for  mes¬ 
sages,  in  case  the  algorithm  has  a  dependency  on  incoming  data.  The  state  machines  also  have 
additional  states  to  enable  various  checkpoint/restart  methods,  the  handling  of  faults,  and  the 
recovery  after  a  fault.  Currently,  the  ghost  cell  pattern  is  implemented.  Implementation  of 
master-slave,  IS,  and  FT  patterns  is  in  progress.  In  the  future,  we  plan  to  extract  communi¬ 
cation  patterns  for  more  complex  applications  such  as  structured  adaptive  mesh  refinement 
codes  (AMR). 

3.3.1.  Ghost  cell  pattern.  The  ghost  cell  pattern  is  a  type  of  structured  grid  computation 
where  each  point  in  an  n-dimensional  grid  maps  to  a  point  in  the  physical  space.  The  value 
of  a  point  is  a  function  of  other  neighboring  points  that  are  updated  iteratively.  This  set  of 
neighboring  points  is  called  a  stencil.  The  specific  stencil  computation  that  determines  a  point 
value  depends  on  the  application.  In  our  version  of  the  ghost  cell  pattern,  a  point  waits  for 
the  value  of  its  four  adjacent  neighbors:  East,  west,  north,  and  south;  however,  it  must  not  be 
necessarily  the  case  that  the  stencil  points  are  adjacent. 

3.3.2.  Master-slave  pattern.  In  the  master- slave  pattern  one  core  is  designated  the  mas¬ 
ter  which  distributes  work  to  slaves  running  in  all  other  cores.  The  slave  processes  do  not 
communicate  among  each  other  and  the  amount  of  work  each  slave  performs  varies.  The 
master  is  responsible  for  collecting  results  from  the  slaves  and  assigning  new  work  as  slaves 
become  available. 

3.3.3.  IS  pattern.  This  pattern  is  extracted  from  the  Integer  Sort  (IS)  of  the  NAS  par¬ 
allel  benchmarks  [3]  and  it  targets  integer  computation  and  communication  performance.  It 
implements  a  particular  bucket  parallel  sort  of  normally  distributed  integers  generated  by  a 
random  process.  The  input  sequence  is  partitioned  and  each  interval  is  assigned  to  a  different 
process.  In  turn  each  interval  is  further  divided  into  buckets  to  hold  integers  in  that  subinter¬ 
val.  The  algorithm  comprises  three  phases:  (1)  the  sequence  of  integer  keys  are  partitioned 
and  sorted  locally;  (2)  each  participating  rank  sends  those  keys  that  do  not  belong  to  that 
rank’s  sorting  subinterval,  that  is,  the  keys  are  redistributed  to  the  appropriate  buckets.  This 
phase  is  communication  intensive.  (3)  The  newly  integrated  keys  are  merged  with  existing 
keys  in  the  corresponding  rank’s  buckets.  This  type  of  sort  has  application  in  particle  method 
simulations. 

3.3.4.  FT  pattern.  This  pattern  is  extracted  from  the  Fast  Fourier  Transform  (FT)  of 
the  NAS  parallel  benchmarks  and  targets  long-distance  communication  performance.  It  im¬ 
plements  a  solution  to  3D  partial  differential  equation  using  Fast  Fourier  Transforms  that  are 
used  in  spectral  codes. 

3.4.  Collective  Communications.  The  FT  and  IS  application  patterns  described  in  Sec¬ 
tion  3.3  perform  coordinated  communication  operations  with  several  processes.  In  the  IS 
pattern,  processes  cooperate  to  compute  the  total  bucket  size  and  the  number  of  keys  each 
process  has  to  send,  and  to  send  those  keys  to  their  respective  processes.  In  the  FT  pattern, 
processes  cooperate  to  transpose  a  distributed  matrix.  Collective  operations  as  their  name 
indicate  are  executed  collectively,  which  means  that  each  process  in  a  process  group  calls  the 
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iso 

COMPUTE 

//  initialize  arrays,  set  timers 

loop  MAX- ITERATIONS 
rank(  iteration  ) 

COMPUTE 

//  gather  timing  information;  verify,  print  results 

rank(  iteration  ) 

COMPUTE 

//  sort  into  buckets 

ALLREDUCE 

//  get  bucket  size  totals 

COMPUTE 

//  calculate  key  redistribution 

ALLTOALL 

//  find  out  how  many  keys  each  process  will  send 

ALLTOALLV 

//  send  key  to  respective  processes 

COMPUTE 

//  calculate  total  number  of  keys,  local  key  sort 

Fig.  3.2.  IS  pattern. 

FTO 

loop  MAX- ITERATIONS 

COMPUTE 

//  index  map,  initial  conditions,  roots-of-unity 

//  evolve  to  next  time 

step  in  Fourier  space 

ALLTOALL 

//  transpose  planes 

REDUCE 

//  checksum 

COMPUTE 

//  verify,  print  results 

Fig.  3.3.  FT  pattern. 


communication  routine  with  the  same  parameters.  Same  as  the  communication  patterns,  col¬ 
lectives  need  to  be  implemented  as  state  machines  to  integrate  them  into  SST.  In  order  to  keep 
the  state  machines  for  the  collective  operations  as  simple  as  possible,  the  communication  pat¬ 
tern  component  requires  that  the  total  number  of  ranks  in  a  system  be  a  power  of  two.  There  is 
one  rank  as  communication  pattern  component  in  each  core.  The  components  further  assume 
that  the  network  is  an  X  x  Y  torus  with  end-around  connections.  Each  router  in  that  torus  is 
connected  to  a  node  which  consists  of  a  v  x  y  network-on-chip  (NoC)  torus.  Each  router  of 
the  NoC  has  one  or  more  cores  attached  to  it,  and  each  one  of  these  routers  corresponds  to 
a  router  model  component  within  SST.  The  network  routers  use  wormhole  routing,  i.e.  one 
message  occupies  an  input  and  output  port  at  a  time,  while  the  NoC  routers  use  Hit-based 
routing  to  more  closely  resemble  a  memory  crossbar  switch.  Following  is  a  description  of  the 
collective  operations  used  in  the  communications  patterns  described  in  Section  3.3. 

3.4.1.  Broadcast.  A  broadcast  performs  a  one-to-all  operation  that  sends  data  from  one 
process  to  all  processes.  We  implement  it  as  a  binary  tree,  with  constant  message  length.  In 
the  tree  representation,  ranks  correspond  to  nodes  and  they  can  be  either  the  root,  interior 
nodes,  or  leaf  nodes.  Figure  3.4  shows  the  broadcast  algorithm  and  Figure  3.5  depicts  the  tree 
representation. 

3.4.2.  Reduce.  Reduce  performs  a  global  reduction  operation  and  returns  the  result  to 
the  root.  It  combines  the  values  in  the  input  buffer  of  each  process  applying  the  specified 
operation  before  each  send  and  placing  the  result  to  the  output  buffer  of  the  root  process.  In 
the  case  of  allreduce,  it  places  the  result  in  the  output  buffer  of  all  processes.  As  with  broad¬ 
cast,  the  reduce  operation  can  also  be  represented  as  a  binary  tree  with  ranks  corresponding 
to  nodes,  but  with  the  data  flowing  in  the  opposite  direction.  Figure  3.6  shows  the  reduce 
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BCAST(R) 

V  ranks  r  e  group  R 
if  (r  =  root) 
sencLright(r) 
sencLleft(r) 
if  (r  =  interior ) 

wait_from_parent (r) 
sencLright(r) 
sencLleft(r) 
if  (r  =  leaf) 

wait_from_parent (r) 

Fig.  3.4.  Broadcast  algorithm. 


Fig.  3.5.  Broadcast  binary  tree  representation. 


algorithm  and  Figure  3.7  shows  its  tree  representation. 

REDUCE (R,  op) 

V  ranks  r  €  group  R 
if  (r  =  root) 

wait_from_right (r) 
wait_from_left (r) 
compute (op) 
if  (r  =  interior ) 
wait_from_right (r) 
wait_from_left (r) 
compute (op) 
send_to_parent (r) 
if  (r  =  leaf) 

send_to_parent (r) 

Fig.  3.6.  Reduce  algorithm. 


3.4.3.  Allreduce.  Allreduce  returns  the  result  of  the  global  reduction  operation  to  all 
processes.  Allreduce  is  typically  implemented  with  a  reduce  to  the  root  followed  by  a  broad¬ 
cast.  The  IS  pattern  performs  an  allreduce  to  get  the  bucket  size  from  all  processes.  Later 
these  numbers  are  accumulated  to  perform  key  redistribution.  Figure  3.8  shows  the  allreduce 
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Fig.  3.7.  Reduce  binary  tree  representation. 


algorithm. 

ALLREDUCE (R ,  op) 

REDUCE (R,  op) 
BCAST(R) 

Fig.  3.8.  Allreduce  algorithm. 


3.4.4.  Alltoall  and  Alltoallv.  Alltoall  sends  a  distinct  message  from  each  process  to 
every  other  process.  It  is  also  known  as  a  complete  exchange.  Alltoallv  also  sends  a  distinct 
message  from  each  process  to  every  other  process,  but  messages  can  have  different  sizes  and 
it  is  possible  for  each  process  to  provide  displacements  for  the  input  and  output  data. 

This  algorithm  posts  all  receives  followed  by  all  sends,  and  waits  for  all  communications 
to  complete.  The  loop  index  for  both  sends  and  receives  is  given  by  (rank  +  i)  mod  commsize, 
otherwise,  all  ranks  would  communicate  with  rank  i  first,  than  rank  i  +  1  and  so  on,  creating 
a  communication  bottleneck  [19].  A  nonblocking  receive  call  indicates  that  the  system  may 
start  writing  data  into  the  receive  buffer.  The  receiver  should  not  access  any  part  of  the  receive 
buffer  after  a  nonblocking  receive  operation  is  called,  until  the  receive  completes;  similarly 
for  the  non  blocking  send  calls,  the  buffer  containing  the  data  to  be  sent  must  not  be  changed 
until  the  operation  is  completed. 

The  FT  pattern  performs  alltoall  to  compute  the  global  matrix  transposes.  The  IS  pattern 
uses  alltoall  to  find  out  how  many  keys  each  process  will  send  out,  and  alltoallv  to  send  those 
keys  to  the  corresponding  processes.  Figure  3.9  shows  the  Alltoall  algorithm. 

3.4.5.  Barrier.  A  barrier  synchronizes  the  execution  of  a  group  of  precesses.  No  process 
returns  from  the  barrier  until  all  processes  have  called  it,  thus  ensuring  all  processes  reach  the 
same  point  in  the  computation  and  are  ready  to  proceed.  Its  purpose  is  to  separate  computation 
phases.  The  FT  pattern  calls  barriers  to  synchronize  the  global  matrix  transposes.  Figure  3.10 
shows  the  Barrier  algorithm. 

3.5.  Simulating  Checkpoint/Restart.  We  aim  at  characterizing  the  overhead  compo¬ 
nents  associated  with  the  different  resilience  methods  we  are  comparing.  To  this  end,  we 
provide  resilience  functionality  to  the  application  by  enabling  Checkpoint/Restart  (CR)  in  the 
simulation  experiments.  If  resilience  is  enabled,  the  state  machine  for  an  application  pattern 
will  have  additional  states  to  provide  resilience  functionality.  These  additional  states  will 
enable  various  CR  mechanisms,  handling  of  faults,  and  recovery  after  a  fault.  In  the  state  ma¬ 
chine,  writing  checkpoints  and  message  logs  causes  additional  delays  based  on  the  amount  of 
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ALLTOALL(R,  coinm_size) 

V  ranks  i  €  group  R 

DO  nonblocking  receive  from  source  (rank  +  i)  mod  comm_size 

V  ranks  i  e  group  R 

DO  nonblocking  send  to  destination  (rank  +  i)  mod  comm_size 
DO  wait  for  all  communications  to  complete 

Fig.  3.9.  Alltoall  algorithm. 

BARRIER (R) 

V  ranks  r  e  group  R 
if  (r  =  root) 
send_left(r) 
send_right(r) 
wait_left(r) 
wait_right(r) 
if  (r  =  interior ) 
wait_for_parent(r) 
send_left(r) 
send_right(r) 
wait_left(r) 
wait_right(r) 
send_to_par ent (r ) 
if  (r  =  leaf) 
wait_parent(r) 
send_to_parent (r) 

Fig.  3.10.  Barrier  algorithm. 


data  being  logged.  If  checkpoint  data  travels  off-node  an  event  that  informs  the  destination 
of  the  checkpoint  data  is  triggered.  Processing  and  storing  of  data  is  emulated  through  delay 
events  with  a  delay  proportional  to  the  amount  of  data  being  stored.  Similarly,  data  retrieval 
from  remote  nodes  during  recovery  is  emulated  with  delays  for  retrieving  local  data  and  event 
traffic. 

4.  Experiments.  The  experiments  consist  of  applying  the  resilience  methods  listed  in 
Table  2.1  to  the  four  communication  patterns  we  described  in  Section  3.3.  We  will  com¬ 
pare  this  results  to  those  previously  obtained  using  a  different  simulator  coupled  with  an  MPI 
implementation  for  redundant  computing  [11].  SST  takes  several  input  parameters  for  the  net¬ 
work,  node  architecture,  and  storage.  For  the  resilience  simulation  experiments,  coordinated 
checkpointing  will  write  checkpoints  to  both  external  storage  and  cabinet  solid  state  disk 
(SSD).  Uncoordinated  checkpointing  will  store  message  logs  in  local  flash  and  checkpoints 
on  external  storage  or  cabinet  SSD.  For  distributed  checkpointing,  we  store  other  node’s 
checkpoints  in  local  flash  and  eventual  coordinated  checkpoints  to  stable  storage,  either  disk 
or  cabinet  SSD.  For  now,  storage  bandwidth  and  latency  are  fixed  parameters,  and  we  assume 
that  storage  is  always  available  for  recovery. 

The  message  pattern  generators  assume  a  very  specific  machine  configuration  for  the 
SST  simulation  framework.  Table  4.1  summarizes  the  simulator  input  parameters. 

5.  Concluding  Remarks.  In  this  paper  we  detailed  the  simulation  method  and  the  ex¬ 
periments  we  propose  to  compare  various  resilience  mechanisms  and  evaluate  their  perfor- 
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SST  simulation  input  parameters 

Total  number  of  cores  n 

2" 

Torus  X  dimension 

2* 

Torus  Y  dimension 

2y 

Network  bandwidth 

1.9  GB/s 

Network  latency 

4.9  fis 

Crossbar  switch  bandwidth 

12.6  GB/s 

Core-router  latency 

150  ns 

Router  hop  delay 

25  ns 

Router  ports 

local,  north,  south,  east,  west 

Fig.  4.1.  SST  input  parameters. 


mance  at  extreme  scale.  Our  objective  is  to  characterize  the  different  components  that  com¬ 
prise  checkpointing  overhead,  and  determine  checkpointing  efficiency  for  several  application 
classes.  Currently  the  ghost  pattern  is  implemented.  It  simulates  ghost  cell  exchanges  on 
a  five-point  stencil  operator  where  each  rank  communicates  with  its  east,  west,  south,  and 
north  neighbor.  Implementations  of  communication  patterns  for  FT,  IS,  and  master/slave  are 
under  way.  In  addition  to  measuring  application  performance,  we  are  interested  in  memory 
and  communication  overheads  of  the  various  resilience  methods.  For  example,  it  is  expected 
that  the  size  of  message  logs  in  uncoordinated  checkpointing  will  grow  at  different  rates  for 
different  message  patterns.  We  are  looking  for  relevant  performance  metrics  for  evaluating 
these  methods.  The  maximum  log  size  could  indicate  the  amount  of  extra  node  volatile  mem¬ 
ory  that  will  not  be  available  to  the  application  or  will  need  to  be  purchased.  We  also  plan  to 
extend  our  model  to  investigate  the  effect  of  the  architecture  on  application  resilience,  more 
specifically  how  the  increased  number  of  cores  in  a  compute  node  affect  the  overhead  asso¬ 
ciated  with  the  different  resilience  methods  and  overall  application  performance.  We  plan  to 
validate  our  results  by  comparing  the  number  of  messages  exchanged  to  the  transmissions 
and  network  utilization  of  an  actual  application.  The  result  will  show  if  our  application  com¬ 
munication  patterns  resemble  those  of  real  applications.  We  also  plan  to  compare  our  results 
to  those  of  a  message  logging  library  currently  under  development.  The  authors  would  like  to 
thank  the  members  of  the  SNL  Fault  Tolerance  group  for  enriching  discussions  on  the  subject, 
and  anonymous  reviewers  for  their  valuable  feedback. 
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STATISTICAL  ANALYSIS  OF  HPC  ALERTS  AND  DEVELOPMENTS  IN  ROOT 

CAUSE  ANALYSIS 

JOEL  M.  VAUGHAN  *,  JON  R.  STEARLEY  +,  SCOTT  A.  MITCHELL  *,  AND  GEORGE  MICHAILIDIS  § 

Abstract.  Supercomputers  are  inherently  complex.  When  they  fail,  determining  which  of  the  many  compo¬ 
nents  making  up  the  system  is  the  cause  of  the  fault  is  challenging.  Solving  this  problem  in  full  requires  a  thorough 
understanding  of  how  alerts  and  faults  are  related.  Building  on  previous  work,  this  paper  further  demonstrates  the 
effectiveness  of  combining  of  system  log  tools  and  statistical  analysis  to  answer  relevant  questions  about  supercom¬ 
puters.  In  the  first  part  of  this  paper,  we  analyze  two  types  of  alerts  on  a  real  system.  We  then  discuss  several 
important  issues  in  root  cause  analysis  and  potential  research  directions. 

1.  Introduction.  Supercomputers,  or  high  performance  computing  (HPC)  systems,  are 
made  up  of  a  complex  network  of  various  types  of  components,  ranging  from  compute  nodes 
to  disks  to  network  cables.  In  a  more  general  sense,  the  users  of  the  system  and  software 
version  may  also  be  considered  components.  Throughout  this  work,  the  term  factor  is  used  to 
distinguish  the  type  of  component,  e.g.  compute  node,  network  cable,  or  user.  The  successful 
operation  of  such  a  machine  requires  all  these  components  to  both  work  individually  and 
interact  in  certain  ways.  Should  one  or  more  component  fail  to  function  or  interact  correctly, 
a  fault  occurs,  resulting  in  the  failure  of  one  or  more  jobs.  In  order  to  determine  the  cause 
of  the  problem,  it  is  necessary  to  examine  system  alerts.  Alerts  are  messages  in  system 
logs  indicating  a  fault  or  potential  fault.  Like  the  logs  in  which  they  reside,  alerts  are  semi- 
structured,  timestamped  text.  This  text  conveys  a  variable  amount  of  information  about  the 
underlying  cause.  The  alerts  themselves  may  be  either  direct ,  in  which  a  component  reports 
that  it  has  experienced  a  fault,  or  indirect ,  where  the  component  reporting  the  fault  is  not  the 
cause.  The  root  cause  problem  refers  to  the  difficult  problem  of  determining  the  component 
responsible  for  a  fault  by  using  the  limited  information  available  in  the  alerts. 

Previous  work  by  the  authors  ([16]  and  [15])  has  demonstrated  a  working  solution  to 
the  root  cause  problem.  This  solution  is  based  on  a  multiple  simultaneous  hypothesis  testing 
framework  with  a  false  discovery  rate  (FDR)  control,  originally  developed  by  Benjamini  and 
Hotchberg  [2].  Although  this  method  has  shown  to  be  effective  in  many  situations,  there  are 
still  challenges  involved  in  the  root  cause  problem.  These  challenges  include  determining  the 
significance  of  component  types,  implementing  the  analysis  tools  on  production  systems,  and 
developing  ways  to  scale  up  the  methods  to  ever  larger  systems.  This  paper  addresses  these 
areas.  However,  further  development  of  these  methods  requires  a  more  detailed  understanding 
of  how  alerts  are  generated  and  related  to  the  underlying  faults.  Thus,  this  paper  examines 
several  statistical  properties  of  these  alerts  in  detail. 

Splunk  [10]  is  a  piece  of  software  designed  to  efficiently  index  and  search  IT  informa¬ 
tion.  It  is  designed  to  be  inherently  customizable  for  specific  situations.  Splunk  has  been 
customized  for  and  deployed  on  Red  Sky,  one  of  the  HPC  systems  at  Sandia  National  Labo¬ 
ratories.  This  customization  and  deployment  has  greatly  reduced  the  amount  of  effort  needed 
to  examine  the  system  logs  and  facilitated  collaborations  between  researchers  and  system 
administrators  [12],  including  the  work  described  here. 

In  this  paper,  we  carefully  examine  the  distribution  of  two  types  of  alerts  (“Out  of  Mem¬ 
ory”  and  “SymbolError”),  and  answer  a  relevant  question  related  to  each  type  of  alert.  For 
the  “Out  of  Memory”  alert,  we  attempt  to  determine  whether  these  alerts  should  be  attributed 
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to  the  job  currently  running  when  an  alert  occurs,  or  as  a  residual  effect  of  a  previous  job.  We 
investigate  to  what  degree  the  “SymbolError”  alerts  vary  with  respect  to  the  physical  location 
of  the  component  recording  the  alert.  We  then  discuss  four  ongoing  research  branches  related 
to  the  root  cause  problem  and  describe  our  current  progress  and  future  direction  in  each  case. 

2.  Distribution  of  Host  and  Cable  Alerts:.  In  this  section,  we  present  an  analysis  of 
two  types  of  alerts  from  Red  Sky.  The  two  alerts  were  chosen  because  they  are  both  cur¬ 
rently  of  interest  to  the  system  administrators.  The  first,  “Out  of  Memory,”  is  reported  by  the 
compute  nodes  themselves,  while  the  second,  “SymbolError,”  is  reported  by  switches  on  the 
Infiniband  network  that  connects  the  compute  nodes  with  each  other  and  with  the  rest  of  the 
system.  For  each  alert,  we  examine,  to  the  degree  currently  possible,  the  distribution  of  alerts 
across  both  time  and  network  space.  In  addition,  for  each  alert  type,  we  attempt  to  shed  light 
on  a  current  and  relevant  question  of  interest  related  to  that  alert.  With  the  “Out  of  Memory” 
alert,  we  investigate  whether  the  alert  is  caused  by  code  running  at  the  time  the  alert  is  ob¬ 
served  or  if  the  alert  is  the  result  of  some  previous  machine  state.  The  “SymbolError”  alerts 
are  examined  as  a  function  of  their  spatial  location,  do  determine  whether  certain  locations 
are  significantly  more  likely  to  experience  such  alerts. 

2.1.  “Out  of  Memory”  Alerts.  Because  the  compute  nodes  of  Red  Sky  are  diskless, 
users  cannot  use  more  memory  than  is  available  on  a  node.  “Out  of  Memory”  (OOM)  events 
occur  when  the  Linux  kernel  kills  a  process  in  order  to  obtain  memory.  Intuitively,  it  seems  as 
though  these  alerts  are  directly  caused  by  users  writing  code  less  carefully  than  is  necessary. 
However,  it  has  been  suggested  that  the  linux  kernels  used  on  the  compute  nodes  can  leak 
memory.  More  specifically,  Brandt  et  al  claim  that  these  alerts  are  the  result  of  memory 
leaks.  [3]  For  example,  one  user  could  use  a  certain  quantity  of  memory,  but  not  all  this 
memory  would  be  available  to  the  next  user.  This  user,  expecting  16  GB  of  memory  might, 
because  of  the  leak,  only  have  access  to  10  GB.  This  would  result  in  an  OOM  and  potentially 
the  failure  of  a  job  that  should  have  succeeded.  In  this  case,  it  would  look  as  though  the 
second  user  was  the  cause  of  the  fault  whereas  the  fault  actually  lay  in  a  different  factor  (the 
OS). 

Using  Splunk,  we  are  able  to  obtain  a  list  of  OOM  alerts,  indicating  a  time  stamp  of  the 
alert  and  the  host  that  reported  the  alert.  Here,  we  can  view  the  data  as  a  realization  of  a 
point  process,  meaning  we  have  available  the  exact  time  and  location  of  each  alert.  Using 
Splunk  “lookups”,  we  associate  these  alerts  with  a  particular  job  and  consequently  with  a 
particular  user.  Finally,  a  database  called  genders  (also  implemented  as  a  lookup  table  within 
Splunk)  gives  both  the  physical  and  network  location  of  the  computers.  We  have  examined 
the  distribution  of  the  alerts  over  many  factors  but  only  present  those  that  seem  the  most 
interesting,  or  those  that  are  relevant  to  the  question  of  interest.  This  analysis  concerns  OOM 
alerts  on  RedSky  from  the  period  of  June  20,  2010  to  June  26,  2010. 

Distribution  of  Alerts.  The  distribution  of  OOM  alerts  were  studied  over  several  factors, 
as  well  as  time  and  space.  Here  we  present  a  small  selection  of  the  analyses  carried  out. 

The  authors  and  others  have  used  a  Poisson  process  as  a  generating  model  for  alerts  and 
the  underlying  faults  in  HPC  systems.  (See  [1]  or  [15].)  In  order  to  determine  how  reasonable 
this  basic  model  describes  OOM  alerts  in  practice,  we  examined  the  inter-arrival  times  of 
the  alerts.  Specifically,  we  examine  the  unique  inter- arrival  times.  That  is,  we  consider  the 
system- wide  inter-arrival  times.  If  OOM  alerts  occur  simultaneously  on  multiple  hosts,  this 
is  considered  a  single  event.  This  simplification  mitigates  the  non-random  effect  of  multiple- 
node  jobs,  where  a  single  job  can  experience  OOM  alerts  simultaneously  on  several  nodes.  If 
the  Poisson  process  is  indeed  a  good  model  for  the  arrival  process  of  the  alerts,  then  the  inter¬ 
arrival  times  between  these  OOM  alerts  will  be  exponentially  distributed.  Figure  2.1  shows 
an  exponential  QQ  plot  of  the  inter-arrival  times.  If  the  observations  match  the  exponential 
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Fig.  2.1.  Analysis  of  OOM  inter-arrival  times.  We  see  that  a  Poisson  model  does  not  appear  appropriate  for 
this  process,  as  the  QQ  plot  does  not  follow  a  45  degree  line. 


distribution,  this  plot  would  show  the  points  along  a  45  degree  line  with  a  positive  slope. 
However,  the  data  deviate  noticeably  from  exponential  distribution,  indicating  that  this  is 
not  a  good  model  for  OOM  alerts.  However,  slightly  more  complicated  models,  such  as  an 
inhomogeneous  Poisson  process,  might  be  appropriate.  Such  a  model  would  take  into  account 
that  the  system’s  utilization  varies  with  time.  Allowing  the  intensity  of  the  Poisson  process  to 
vary  with  time  would  better  reflect  the  behavior  of  the  system. 

We  next  examine  when  OOM  alerts  occur  within  a  particular  job.  In  this  paragraph  and 
the  next,  we  consider  all  OOMs  from  our  period  of  study,  and  their  temporal  location  within 
their  job  (the  job  to  which  the  processor  experiencing  the  OOM  is  currently  assigned).  In 
particular,  we  can  examine  the  time  between  the  start  of  a  job  and  the  OOM  alerts  or  the  time 
from  the  OOM  to  the  end  of  the  job.  Both  distributions  are  shown  in  Figure  2.2.  We  see  that 
the  vast  majority  of  times  between  Job  Start  and  OOMs  are  between  one  and  ten  minutes, 
with  a  very  long  tail  to  the  right.  (The  figure  is  truncated.)  That  is,  most  OOMs  occur  within 
the  first  ten  minutes  of  the  beginning  of  the  job,  while  a  few  occur  much  later. 

On  the  other  hand,  the  second  figure  shows  a  bimodal  distribution  of  the  time  from  the 
OOM  alert  to  the  end  of  the  job.  This  shows  that  jobs  tend  to  end  either  soon  after  (or 
slightly  before)  the  OOM  or  continue  for  a  long  time  before  ending.  This  would  suggest 
that  the  state  that  causes  the  OOM  alert  is  either  immediately  fatal  (the  first  case)  or  allows 
the  job  to  continue  (the  second  case.)  Further  analysis  using  the  logs  reveal  that  these  two 
modes  represent  very  different  distributions  of  the  several  possible  “end  states”  of  the  job. 
Additional  analysis  of  these  state  labels  are  necessary,  but  initial  inspection  shows  that  many 
of  the  jobs  that  continue  running  long  after  they  experience  OOMs  end  in  a  “failed”  state. 
If  this  label  accurately  indicates  that  the  job  is  unsuccessful,  this  result  suggests  a  policy  of 
killing  such  jobs  might  be  useful  in  both  saving  machine  resources  and  allowing  users  to 
know  of  problems  in  their  code  hours  sooner. 

Question  of  Interest.  One  relevant  question  is  whether  OOMs  are  more  likely  caused 
by  the  current  user  (the  user  who  experienced  the  OOM  during  his  or  her  job)  or  the  memory 
leak  phenomenon.  Generally,  the  OOM  alerts  are  related  to  code  using  more  memory  than 
is  available.  Since  it  is  not  possible  to  directly  examine  the  code  for  each  job,  we  instead 
examine  the  users.  Some  users,  and  the  code  they  run,  are  more  likely  to  stress  the  memory 
of  the  machine.  Thus,  the  question  of  memory  leaks  vs.  current  user  can  be  asked  as  the 
question  of  whether  the  users  running  the  jobs  are  more  strongly  associated  with  the  alerts, 
or  the  prior  users  of  nodes  are  more  strongly  associated.  We  examined  the  OOM  alerts  as  a 
function  of  current  user,  and  as  a  function  of  previous  user.  That  is,  rather  than  attributing 
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(a)  Time  To  OOM  (b)  Time  From  OOM 

Fig.  2.2.  Distribution  of  time  from  Job  Start  until  OOM  alert  (left)  and  distribution  of  time  from  OOM  alert  to 
Job  End  ( right).  Note  that  the  negative  values  in  the  right-hand  graph  indicate  the  job  was  ended  by  the  scheduler 
before  the  OOM  alert  on  that  node. 


the  alert  to  the  individual  running  a  job,  they  were  attributed  to  the  individual  who  had  last 
run  a  job  on  the  host  reporting  the  OOM.  If  the  alerts  are  caused  by  the  current  user,  we 
would  expect  a  stronger  association  between  alerts  and  current  user,  while  if  they  were  in  fact 
caused  by  the  memory  leak  problem,  we  would  expect  a  stronger  association  between  the 
previous  users  and  the  OOM  alerts,  because  the  previous  user  would  be  the  one  responsible 
for  stressing  the  memory  and  putting  the  node  in  the  vulnerable  state. 

Figure  2.3  shows  the  association  between  OOM  alerts  and  users.  In  these  figures,  the 
association  of  OOM  alert  with  host  is  maintained,  while  the  alerts  are  attributed  to  users  in 
different  ways.  As  a  baseline  (on  the  left)  we  show  the  alerts  randomly  assigned  to  users, 
weighted  by  overall  usage.  If  there  were  truly  no  association  between  the  OOM  alerts  and 
users,  we  would  expect  to  see  a  graph  such  as  this.  Next,  in  the  middle,  we  show  the  results 
of  attributing  the  OOM  alerts  to  the  users  running  the  jobs  that  experienced  them.  In  the  third 
figure,  we  show  the  results  of  attributing  the  alerts  to  the  previous  user.  We  constructed  other 
variations  of  the  previous  user  graph,  but  all  were  somewhat  similar  to  the  one  displayed. 
We  see  that  neither  the  current  nor  previous  user  allocation  resembles  the  random  assignment 
graph,  and  thus  there  is  a  signal  present  in  both  of  these  assignments,  indicating  that  both 
factors  might  indeed  be  relevant.  The  signal  appears  stronger  in  the  current  user  setting,  indi¬ 
cating  that  it  is  a  more  significant  factor  than  the  previous  user;  however,  the  signal  associated 
with  previous  user  should  not  be  ignored. 

To  further  investigate,  we  carried  out  our  FDR  analysis  [15]  on  both  the  current  user  and 
previous  user  factors.  The  results  are  shown  in  Table  2.1.  In  each  case,  the  distribution  of  p- 
values  was  also  examined  via  histograms  (not  shown).  They  are  more  uniformly  distributed 
in  the  previous  user  case,  indicating  that  this  case  has  a  weaker  signal.  We  use  the  job  table 
to  conduct  a  closer  examination  of  the  “significant”  users  in  each  case.  The  most  significant 
previous  user,  user  117,  ran  only  one  job  over  the  time  period  studied  and  experienced  no 
OOMs  directly.  The  nodes  in  the  job  were  next  taken  by  four  different  jobs,  only  one  ex¬ 
perienced  OOM  alerts.  This  job  was  run  by  user  user  12,  who  was  identified  as  significant 
in  both  the  current  and  previous  user  case.  It  is  more  reasonable  to  assume  that  user  12  is 
responsible  for  these  alerts,  given  the  other  evidence  against  him,  than  to  assume  they  were 
caused  by  a  memory  condition  established  by  user  1 17  .  Furthermore,  user  12  often  ran  suc¬ 
cessive  jobs  on  the  same  node,  meaning  many  OOMs  attributed  to  this  person  as  the  previous 
user  are  also  attributed  as  the  current  user,  possibly  explaining  the  user’s  significance.  Thus, 
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Fig.  2.3.  Distribution  ofOOM  alerts  across  users  and  hosts,  via  different  methods  of  attributing  alerts  to  users. 
In  all  figures,  users  are  arranged  from  left  to  right  in  order  of  decreasing  machine  usage.  The  left  figure  shows 
the  distribution  if  alerts  are  randomly  assigned  to  users,  based  on  each  user’s  total  node-hours.  The  middle  figure 
shows  the  distribution  if  alerts  are  attributed  to  the  current  user.  The  right  figure  shows  the  distribution  if  the  alerts 
are  attributed  to  the  previous  user.  In  comparison  the  randomly  assigned  alerts,  both  the  current  and  previous  user 
distributions  show  a  signal,  indicating  that  both  are  relevant  factors,  with  the  stronger  signal  in  the  current  user 
graph  indicating  that  this  factor  is  more  important  with  regards  to  OOM  alerts. 
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user73 
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0.018 
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user50 
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0.018 

False 

userl4 

0.17 

0.02 

False 

user88 

Table  2.1 

0.083 

0.02 

False 

Results  of  FDR  Root  Cause  Analysis  ofOOM  alerts,  comparing  results  of  associating  alerts  with  current  user 
(left)  and  previous  user  (right).  In  each  case,  the  ten  users  with  the  smallest  p-values  are  displayed,  and  sorted 
by  p-value.  The  rejection  criteria  is  a  varying  threshold  based  on  the  false  discovery  rate  controlling  procedure  of 
Benjamini  and  Hochberg  [2],  which  is  discussed  in  relation  to  the  Fault-Cause  Problem  in  [16].  Note  that  in  both 
cases,  two  users  are  chosen  as  significant.  (Although  the  data  is  real,  the  user  names  have  been  anonymized.  The 
numbers  correspond  to  the  columns  in  Figure  2.3.) 


the  two  significant  previous  users  are  easily  explained  in  terms  of  current  users.  In  contrast, 
the  two  significant  current  users  are  not  easily  explained  in  terms  of  previous  users.  Each  ran 
many  jobs  that  experienced  OOMs,  and  these  jobs  had  a  large  number  of  different  previous 
users.  After  this  is  taken  into  account,  there  is  much  stronger  evidence  associating  the  OOM 
alerts  with  the  current  user,  as  opposed  to  the  previous  user  as  the  result  of  a  memory  leak. 
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Fig.  2.4.  Time  series  of  SymbolError  alerts,  by  host  (left)  and  port  (right).  The  number  of  alerts  is  shown  on  a 
log  scale  to  improve  visibility.  The  yellow  block  in  the  upper  right  area  is  a  host  that  experienced  a  high  number  of 
alerts  during  a  few  consecutive  time  periods.  The  system  administrators  suggested  that  this  is  the  result  of  a  change 
made  to  the  system. 


2.2.  “SymbolError”  Alerts.  The  Infiniband  network  that  connects  the  various  com¬ 
ponents  is  crucial  for  the  successful  execution  of  all  jobs,  whether  to  communicate  across 
multiple  nodes,  access  the  network  disk  arrays,  or  even  launch.  SymbolErrors  occur  when 
a  switch  recognizes  a  portion  of  a  received  packet  as  being  corrupted  during  transmission. 
Unlike  the  OOM  alerts,  the  SymbolError  alerts  are  not  reported  with  individual  time  stamps. 
Every  ten  minutes,  each  switch  reports  the  number  of  SymbolErrors  it  experienced  on  each 
of  its  ports.  Thus,  this  data  is  a  set  of  time  series,  one  for  each  switch/port  pair.  In  order  to 
associate  these  with  jobs  as  we  did  with  OOMs,  we  must  join  the  alerts,  jobs,  and  network 
routing  tables.  The  routing  tables  consist  of  the  ordered  list  of  network  ports  a  transmission 
passes  through  for  host  A  to  communicate  with  host  B  (and  B  to  A  takes  a  different  path). 
There  are  five  to  ten  million  such  route  changes  per  day,  and  we  are  still  working  on  suffi¬ 
ciently  joining  all  these  data.  However,  via  the  afore-mentioned  genders  table  and  a  table  of 
physical  links  (which  do  not  change  with  time),  we  are  able  to  associate  the  alerts  with  their 
physical  locations.  Below,  we  examine  the  SymbolErrors  that  occurred  on  Red  Sky  between 
July  1,2010  and  July  7,  2010. 

Distribution  of  Alerts.  We  next  present  a  subset  of  the  analysis  of  SymbolError  alerts. 
Figure  2.4  shows  the  time  series  of  the  alerts  by  host  and  port.  We  notice  different  patterns 
across  hosts,  with  a  high-alert  event  near  the  end  of  the  period  studied.  Some  hosts  and 
ports  that  rarely  see  SymbolErrors,  while  others  see  them  somewhat  regularly.  A  third  set 
saw  them  infrequently  except  during  a  small  time  window,  during  which  they  saw  many. 
As  with  the  OOM  alerts,  it  is  of  interest  to  determine  if  a  Poisson  model  is  appropriate  to 
describe  the  behavior  of  SymbolErrors.  Since  event  inter- arrival  times  are  not  available,  we 
instead  evaluate  whether  the  counts  of  SymbolErrors  among  time  periods  follow  a  Poisson 
distribution.  Therefore,  we  construct  QQ  plots  comparing  the  observed  distribution  of  counts 
to  a  Poisson  distribution.  The  results,  in  Figure  2.5,  indicate  that  the  Poisson  model  is  not 
appropriate  to  model  the  aggregate  SymbolErrors.  We  followed  up  by  examining  each  host 
separately.  Of  the  52  hosts  experiencing  alerts  during  the  time  studied,  9  exhibit  Poisson 
behavior.  One  such  host,  (b4-neml-b)  is  shown  on  the  right-hand  side  of  Figure  2.5.  Of  the 
remaining  hosts,  27  showed  rare  positive  alert  counts  of  small  magnitude,  while  16  showed 
equally  rare  alert  occurrences,  but  where  the  counts  were  much  higher.  While  the  27  hosts 
with  rare  small  positive  counts  could  be  modeled  as  Poisson  with  a  very  low  rate,  the  model  is 
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Fig.  2.5.  Poisson  QQ  plots  of  SymbolError  time  series.  The  left  plot,  showing  the  aggregate  SymbolErrors 
from  all  hosts,  shows  a  deviation  from  the  Poisson  distribution.  The  right  plot,  which  considers  SymbolErrors  on  a 
single  host,  shows  that  the  Poisson  model  is  reasonable  for  this  host.  Eight  other  hosts  show  similar  behavior,  while 
27  hosts  could  be  modeled  by  a  Poisson  process  with  a  low  rate.  However,  16  hosts  have  rare  but  high  counts,  a 
phenomenon  which  is  not  captured  well  by  the  Poisson  model  that  spoils  the  aggregate  model. 


not  appropriate  for  the  hosts  that  have  rare  but  large  counts.  These  results  suggest  a  modeling 
approach  where  each  host  independently  exists  in  one  of  three  alert  alert-generating  states: 
off,  when  the  host  experiences  no  alerts;  on,  where  the  host  experiences  alerts  according  to  a 
Poisson  mechanism;  and  high,  where  the  host  experiences  a  high  number  of  alerts. 

Question  of  Interest.  The  nodes  making  up  Red  Sky  are  placed  in  racks,  arranged  in 
four  rows  and  five  columns.  Each  rack  consists  of  four  shelves,  stacked  one  on  top  of  the 
other.  Each  shelf  consists  of  12  slots  that  contain  the  individual  compute  nodes.  It  has  been 
hypothesized  that  there  is  a  manufacturing  defect  in  the  hardware,  making  SymbolErrors 
more  likely  in  the  backplane  connections  to  certain  slots.  We  investigated  the  distribution 
of  SymbolErrors  across  slots,  shown  in  Figure  2.6.  In  the  left  hand  figure,  it  is  clear  that 
SymbolErrors  occur  more  frequently  in  certain  slots  (e.g.  slot  8)  than  others.  The  right  hand 
graph  shows  the  numbers  of  populated  slots.  The  near-uniform  distribution  of  this  figure 
indicates  that  the  higher  number  of  alerts  in  certain  slots  is  not  simply  because  these  slots 
are  occupied  more  than  other  slots.  To  be  more  rigorous,  we  perform  a  Chi-squared  test 
against  the  uniform  distribution  where  each  slot  is  equally  likely  to  receive  alerts,  and  get  a 
test  statistic  of  X 2  =  5807.4  and  a  p-value  «  0.  Normalizing  by  usage  results  in  a  test  statistic 
of  X 2  =  7386.3  and  a  p-value  «  0.  Both  tests  indicate  the  distribution  of  SymbolErrors 
across  slots  is  significantly  different  from  what  would  be  expected  if  the  alerts  were  randomly 
assigned  slots,  indicating  that  there  is  some  systematic  difference  across  the  machine.  One 
explanation  could  be  a  manufacturing  defect,  although  other  possibilities,  such  as  the  way 
cables  rest,  could  also  be  responsible. 

We  also  examined  the  distribution  across  the  other  spatial  dimensions:  shelf,  row,  and 
column.  These  distributions  can  be  seen  in  Figure  2.7.  It  is  surprising  to  note  that  in  each 
case,  there  is  one  or  two  levels  that  experience  more  alerts  than  others.  The  patterns  in  Figures 
2.6  and  2.7  are  not  the  result  of  a  single  faulty  host,  suggesting  a  systematic  cause  across  these 
dimensions.  This  phenomenon  had  not  been  previously  noticed  by  the  system  administrators, 
and  there  is  currently  no  explanation  for  the  systematic  differences  noticed,  although  it  is 
being  investigated. 

3.  Ongoing  Developments  in  Root  Cause  Analysis.  In  the  previous  sections,  we  have 
shown  the  usefulness  of  statistical  analysis  of  log  data  in  answering  relevant  questions  about 
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Fig.  2.6.  Distribution  of  SymbolErrors  across  Slot,  showing  noticeably  more  SymbolErrors  in  certain  slots.  The 
top  left  plot  shows  the  total  number  of  SymbolErrors  by  slot,  while  the  lower  left  plot  shows  the  total  number  of  time 
windows  with  a  positive  SymbolError  count  for  each  slot.  The  slot  usage  (right)  is  much  more  uniform  than  either 
graph. 


(a)  Shelf  (b)  Row  (c)  Column 


Fig.  2.7.  Distribution  of  SymbolErrors  across  shelf,  row,  and  column.  The  top  figure  shows  the  number  of 
SymbolErrors,  while  the  lower  figure  shows  the  number  of  time  windows  with  a  positive  SymbolError  count  for  each 
dimension.  Note  that  all  three  of  these  dimensions,  some  entries  see  noticeably  more  SymbolErrors  than  others. 


Red  Sky.  In  the  remaining  sections,  we  discuss  our  ongoing  work  in  Root  Cause  Analy¬ 
sis,  showing  how  statistical  analysis  of  logs  is  being  developed  to  be  both  more  effective  at 
determining  the  root  cause  and  more  accessible  to  network  administrators. 

3.1.  Custom  Splunk  Search  and  Analysis.  Here  we  report  on  the  start  of  turning  our 
research  methodology  into  a  deployed  capability.  Root  cause  analysis  like  the  work  presented 
above  and  the  FDR-based  analysis  we  present  in  [16]  and  [15]  is  useful,  but  it  requires  multi¬ 
ple  steps:  aggregating  and  processing  multiple  system  logs,  searching  these  logs  for  relevant 
information,  performing  relevant  statistical  analyses,  and  reporting  on  the  results.  If  these 
steps  need  to  be  performed  by  multiple  individuals,  even  the  useful  methodologies  we  have 
developed  have  little  chance  of  being  employed  in  practice.  Splunk’s  capabilities  to  efficiently 
combine  and  present  data  from  across  multiple  logs  have  already  been  employed  for  the  first 
steps  in  this  analysis,  with  a  considerable  savings  in  the  work  necessary  to  examine  the  logs, 
as  described  in  [12]. 

Splunk  provides  several  ways  to  create  custom  search  commands  within  the  program 
[11].  By  taking  advantage  of  Splunk’s  Python  interface,  we  have  prepared  some  of  the  anal¬ 
yses  from  [15]  as  custom  searches,  shown  in  Figure  3.1.  Thus,  Splunk  can  be  used  by  system 
administrators  or  others  to  search,  aggregate,  and  analyze  the  data  to  perform  FDR  based 
root  cause  analysis  of  certain  types  of  alerts  using  a  single  command  from  the  Splunk  web 
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Fig.  3.1.  Screenshot  of  results  of  FDR  custom  search  command.  With  a  single  command,  the  splunk  user 
dispatches  the  relevant  searches  and  performs  our  FDR  root  cause  analysis,  generating  the  output  shown. 


3.2.  Cables  as  Components.  It  has  been  determined  by  extensive  testing  that  the  num¬ 
ber  of  SymbolErrors  is  dependent  on  the  traffic  patterns  that  the  cables  are  carrying.  Thus, 
based  on  our  previous  work  [15],  a  job-centric  analysis  of  the  SymbolError  alerts  should 
therefore  help  determine  which  jobs  “cause”  SymbolErrors  (or  at  least,  how  to  reliably  repro¬ 
duce  their  occurrence,  which  is  often  the  first  step  to  finding  a  solution).  There  is,  however, 
a  key  difference.  The  alerts  studied  in  our  previous  work  appear  on  hosts.  A  job-centric 
analysis  of  these  alerts  assigns  each  component  to  a  single  job  at  a  given  time.  However, 
cables  may  be  shared  by  multiple  jobs  at  any  given  moment.  Our  methodology  does  handle 
the  situation  of  shared  components,  but  as  previously  described  we  are  currently  unable  to 
perform  the  necessary  join  of  alert,  job,  and  routing  information. 

3.3.  Component  Types  by  Significance.  We  now  briefly  discuss  an  ongoing  and  im¬ 
portant  research  question.  Consider  again  the  OOM  alerts  described  in  Section  2.1.  With  our 
existing  methodology,  it  is  possible  to  independent  search  for  statistically  significant  hosts  or 
users,  as  well  as  other  factors.  However,  these  factors  are  closely  related.  In  fact,  the  very 
same  alerts  must  be  used  for  both  user  and  host,  since  the  alerts  occur  on  a  particular  host 
while  being  employed  by  a  particular  user.  Thus,  if  an  analysis  turns  up  significant  compo¬ 
nents  of  multiple  types,  which  type  of  component  is  most  likely  to  be  the  root  cause?  This 
problem  is  of  primary  importance,  as  a  successful  answer  will  result  in  effort  being  quickly 
focused  on  solving  the  problem  at  the  root  cause,  rather  than  trying  to  determine  the  cause 
or  trying  to  fix  a  component  other  than  the  cause  of  the  problem.  Often,  further  analysis  can 
show,  at  least  intuitively,  that  one  factor  is  more  relevant  than  another.  Consider  again  Figure 
2.3(b).  This  figure  shows  that  although  both  hosts  and  users  might  be  identified  as  signifi¬ 
cant,  the  spread  of  these  alerts  across  users  is  far  more  organized  than  the  distribution  across 
hosts,  indicating  that  the  user  factor  is  the  most  relevant  to  this  type  of  alert.  Although  this 
follow-up  analysis  is  useful,  it  relies  partially  on  intuition.  As  we  continue  working  on  this 
problem,  we  hope  to  both  formalize  this  technique  and  automate  the  procedure,  making  it 
available  to  system  administrators  as  a  useful  and  relevant  tool.  One  possible  solution  is  the 
group  hypothesis  testing  approach  discussed  in  the  following  section. 

3.4.  Group  Hypothesis  Testing.  Our  original  approach  to  the  root  cause  problem  is 
based  on  hypothesis  testing  with  FDR  control.  This  method  was  also  used  extensively  in 
communities  analyzing  gene  microarray  experiments.  In  both  contexts,  the  large  number 
of  simultaneous  tests  gained  statistical  power  from  the  FDR  based  approach  as  compared  to 
more  traditional  multiple  testing  paradigms.  However,  as  technology  advances,  the  number  of 
tests  is  continuing  to  grow,  and  pressing  the  limits  of  even  the  FDR  based  method.  This  is  true 
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both  in  the  microarray  setting,  where  improved  technology  allows  for  an  ever  increasing  num¬ 
ber  of  experiments  to  analyze,  and  the  HPC  setting,  where  computers  are  being  constructed 
with  more  and  more  components.  As  the  number  of  components  increases,  it  becomes  nearly 
impossible  to  obtain  a  statistically  significant  result,  even  in  the  presence  of  strong  evidence. 

In  order  to  handle  the  increasing  number  of  hypotheses,  those  working  with  microarrays 
have  suggested  grouping  meaningful  sets  of  individual  genes  together,  and  conducting  a  series 
of  hypotheses  tests  across  the  groups  rather  than  the  individual  genes.  If  relevant  to  the 
research  question  at  hand,  a  follow-up  analysis  can  examine  only  the  significant  groups  and 
test  the  individual  genes  within  the  group.  (See  [13]  and  [4].)  This  procedure  makes  large 
gains  in  power,  as  there  are  fewer  hypotheses  both  when  testing  for  significant  groups  and 
when  testing  within  the  group.  The  complexity  of  HPC  systems  and  the  nature  of  our  data 
available  make  it  difficult  to  apply  these  methods  directly;  however,  we  have  made  appropriate 
modifications  to  the  methods  advanced  by  Efron  and  Tibshirani  [4]  and  carried  out  some 
preliminary  simulation  studies.  These  studies  showed  the  potential  of  this  method  to  greatly 
improve  the  overall  power  of  the  procedure  in  situations  where  our  previous  method  would 
suffer. 

As  noted  by  Efron  and  Tibshirani,  the  groups  need  to  be  constructed  in  a  meaningful  way. 
Our  simulations  have  suggested  that  poorly  chosen  groups  do  not  provide  sufficient  statistical 
power,  while  well-chosen  groups  perform  very  well.  Table  3.1  shows  the  power  obtained  in 
our  simulations.  Note  that  the  power  depends  strongly  on  the  grouping  structure  used  and 
in  some  cases  on  the  number  of  groups  considered.  Thus,  it  is  important  to  use  a  carefully 
chosen  group  structure. 

The  importance  of  the  chosen  group  structure  in  this  procedure  has  the  potential  to  be 
useful  in  solving  the  root  cause  problem,  particularly  the  issue  of  determining  the  most  signif¬ 
icant  factors,  as  discussed  in  Section  3.3.  Rather  than  trying  to  determine  a  single  grouping 
structure,  we  propose  developing  several  grouping  structure,  each  based  around  a  specific 
factor  type.  We  then  will  run  the  testing  procedure  on  each  one  of  the  groups.  If  the  grouping 
structure  is  not  meaningful,  no  group  should  be  chosen  as  significant,  and  this  will  suggest 
that  the  factor  associated  with  the  grouping  structure  is  not  significant  as  a  root  cause.  How¬ 
ever,  if  a  grouping  structure  is  meaningful,  one  or  more  groups  will  be  identified  as  signifi¬ 
cant.  This  will  identify  the  associated  factor  as  significant,  and  an  analysis  of  the  significant 
groups  will  identify  which  individual  components  are  the  root  cause.  For  example,  consider 
two  grouping  strategies,  one  that  puts  together  users  running  similar  code,  while  the  other 
puts  together  hosts  that  are  in  similar  physical  locations.  If  the  root  cause  of  a  problem  is 
user-related,  one  or  more  of  the  user-based  groups  will  be  significant,  while  the  host-based 
groups  would  not  be. 

Although  the  above  plan  is  theoretically  promising,  one  difficulty  that  must  be  overcome 
is  creating  group  structures  that  associate  with  the  factors  being  considered.  We  have  begun 
exploring  the  creation  of  these  groups  by  partitioning  a  series  of  association  graphs.  For  a 
given  factor,  we  hope  to  construct  a  graph  describing  the  associations  between  components, 
and  then  form  groups  by  partitioning  that  graph.  For  partitioning,  we  plan  to  use  software 
called  Metis  [7]. 

4.  Conclusion.  By  using  the  capabilities  of  Splunk,  we  have  been  able  to  gather  and 
aggregate  data  on  system  alerts  from  a  diverse  set  of  relevant  logs.  This  information  has  then 
been  used  to  perform  statistical  analysis  to  answer  relevant  questions  about  Red  Sky,  and 
raised  some  new  questions.  In  addition,  this  work  has  demonstrated  the  necessity  of  continu¬ 
ing  work  on  the  root  cause  problem,  formally  addressing  both  the  problem  of  determining  the 
most  significant  factor  and  that  of  increasing  numbers  of  components.  Finally,  in  order  for 
these  methods  to  be  useful,  they  must  be  readily  available  to  those  who  maintain  such  sys- 
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Direct  Alerts 

Indirect  Alerts 

Number 

Structure 

Structure 

Structure 

Structure 

Structure 

Structure 

of  Groups: 

1 

2 

3 

1 

2 

3 

5 

1.00 

0.06 

0.00 

1.00 

0.05 

0.08 

10 

1.00 

0.00 

0.66 

1.00 

0.05 

0.40 

15 

1.00 

0.03 

0.77 

1.00 

0.08 

0.63 

25 

1.00 

0.75 

0.81 

1.00 

0.29 

0.79 

Table  3.1 


Statistical  power  of  group  hypothesis  testing  method.  Three  different  grouping  structures  were  tested  for  each 
of  200  simulated  iterations.  The  structures  were  chosen  so  that  in  the  simulation,  different  numbers  of  faulty  hosts 
were  put  into  groups  in  the  three  different  structures.  Notice  that  the  power  of  the  group  testing  depends  greatly  on 
both  the  grouping  structure  chosen  for  the  analysis  and  the  number  of  groups  chosen.  Structure  1,  which  collects  all 
faulty  hosts  in  the  same  group,  performs  well  ( power  ~7 )  regardless  of  the  number  of  groups.  Structures  2  and  3, 
which  spread  the  faulty  hosts  over  multiple  groups,  perform  better  with  a  larger  number  of  groups. 


terns.  By  providing  our  methods  via  custom  Splunk  searches,  it  is  more  likely  that  progress 
made  on  the  root  cause  problem  will  translate  into  less  time  spent  finding  problems  and  more 
time  using  HPC  systems  for  their  intended  purpose. 
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TECHNIQUES  FOR  MANAGING  DATA  DISTRIBUTION  IN  NUMA  SYSTEMS 

ALEXANDER  M.  MERRITT  *  AND  KEVIN  T.  PEDRETTI  1 


Abstract.  MPI  is  the  dominant  programming  model  for  distributed  memory  parallel  computers,  and  is  often 
used  as  the  intra-node  programming  model  on  multi-core  compute  nodes.  However,  application  developers  are  in¬ 
creasingly  turning  to  hybrid  models  that  use  threading  within  a  node  and  MPI  between  nodes.  In  contrast  to  MPI, 
most  current  threaded  models  do  not  require  application  developers  to  deal  explicitly  with  data  locality.  With  increas¬ 
ing  core  counts  and  deeper  NUMA  hierarchies  seen  in  the  upcoming  LANL/SNL  “Cielo”  capability  supercomputer, 
data  distribution  poses  an  upper  boundary  on  intra-node  scalability  within  threaded  applications.  Data  locality  there¬ 
fore  has  to  be  identified  at  runtime  using  static  memory  allocation  policies  such  as  first-touch  or  next-touch,  or 
specified  by  the  application  user  at  launch  time.  We  evaluate  several  existing  techniques  for  managing  data  distri¬ 
bution  using  micro-benchmarks  on  an  AMD  “Magny-Cours”  system  with  24  cores  among  4  NUMA  domains  and 
argue  for  the  adoption  of  a  dynamic  runtime  system  implemented  at  the  kernel  level,  employing  a  novel  page  table 
replication  scheme  to  gather  per-NUMA  domain  memory  access  traces. 


1.  Introduction.  Scalability  plays  an  important  role  in  high-performance  computing. 
Large-scale  simulations  of  nuclear  reactions,  nuclear  decay,  climate  modeling,  fluid  dynam¬ 
ics  and  combustion  all  require  greater  precision  and  reduced  time-to-solution  to  achieve  ac¬ 
curate  and  timely  predictions.  In  order  to  achieve  this  goal  hardware  and  software  must  scale 
well  together.  With  the  onset  of  exascale  computing  current  parallel  programming  models 
approach  their  limits  in  scalability  as  supercomputing  hardware  transitions  from  distributed- 
memory  single-core  processor  architectures  to  hybrids  of  distributed-memory  and  shared- 
memory,  heterogeneous  and  accelerator-supported  systems. 

Our  work  is  focused  on  the  architecture  of  the  upcoming  LANL/SNL  “Cielo”  capabil¬ 
ity  supercomputer,  which  consists  of  AMD  Opteron  “Magny-Cours”  non-uniform  memory 
access  (NUMA)  processors.  In  contrast  to  previous  SNL  systems  such  as  ASCI  Red  and 
Red  Storm,  Cielo ’s  hardware  exhibits  greater  complexity  within  the  node  and  processor  it¬ 
self,  adding  more  cores  and  greater  variation  in  memory  access  latencies.  Non-local  memory 
accesses  comprise  multiple  levels  of  non-uniformity,  incurring  different  penalties  depending 
on  the  path  traveled.  Figure  1 . 1  illustrates  our  24-core  4  NUMA  domain  dual-socket  Magny- 
Cours  experimental  system.  Off-chip  diagonal  HyperTransport  links  exhibit  the  largest  la¬ 
tencies  and  are  8-bit  wide,  which  is  half  as  wide  as  all  other  processor  interconnects  in  the 
system.  Each  Cielo  compute  node  is  similar  to  our  test  system  except  that  Cielo  uses  8- 
core  processors  instead  of  12-core,  and  that  the  Cielo  cores  operate  at  2.4  GHz  instead  of 
2.2  GHz  on  our  test  system.  As  parallel  applications  become  increasingly  hybrid — utilizing 
threaded  programming  models  combined  with  message  passing,  for  example — performance 
degradation  from  inadequate  intra-node  data  distribution  on  this  architecture  becomes  more 
pronounced. 

In  this  paper  we  discuss  the  hybridization  of  parallel  programming  models  and  how  they 
perform  in  light  of  this  new  architecture,  focusing  on  the  evaluation  of  current  intra-node 
approaches  to  data  distribution  that  attempt  to  minimize  the  influence  of  NUMA  latencies  in 
multithreaded  applications.  We  demonstrate  that  these  approaches  are  not  adequate  to  address 
losses  in  performance  and  additionally  require  developers  to  modify  their  application  code. 
We  argue  for  the  use  of  a  dynamic  runtime  system  within  the  operating  system  to  identify 
data  access  patterns  of  an  application  and  use  this  information  to  redistribute  data,  avoiding 
code  changes  and  user  intervention  current  approaches  require. 


*  Georgia  Institute  of  Technology,  merritt.alex@gatech.edu 
'Sandia  National  Laboratories,  ktpedre@sandia.gov 
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Fig.  1.1.  Two  AMD  Opteron  6174  processors. 


1.1.  Parallel  Programming  Models.  Programming  languages  have  evolved  along  with 
the  hardware  on  which  they  run.  MPI  [1]  is  a  library-based  message-passing  extension  to  ex¬ 
isting  languages,  such  as  C  and  Fortran.  This  model  is  a  good  fit  for  the  distributed-memory 
architectures  of  supercomputers.  Parallel  units  of  execution  called  ranks  exist  in  their  own 
address  space,  each  occupying  one  CPU  core  per  compute  node.  In  this  model,  communica¬ 
tion  and  data  sharing  are  explicit  and  must  be  known  to  the  application  developer1  enabling 
him  or  her  to  finely  tune  algorithms  for  minimizing  communication  overheads.  This  advan¬ 
tage  of  MPI  is,  however,  also  its  drawback:  scaling  applications  to  the  extreme  of  exascale 
computing  with  hundreds  of  millions  of  processors  [12]  detracts  from  programmer  produc¬ 
tivity  in  addition  to  making  debugging  difficult.  More  limitations  on  scalability  arise  with  the 
introduction  of  modern  supercomputing  hardware:  higher  core  counts — more  ranks — within 
a  node  increase  memory  consumed  by  global  state  replication  and  increase  communication, 
causing  contention  on  processor  interconnects.  Furthermore,  a  recent  research  study  [14] 
compared  the  performances  of  message-passing  and  threaded  programming  implementations 
of  spare  matrix-vector  multiplication  code,  demonstrating  that  threaded  models  perform  bet¬ 
ter  on  SMP  hardware.  These  are  all  motivations  for  relaxing  the  trend  of  running  “MPI 
everywhere”. 

Thread-based  parallel  programming  models  such  as  OpenMP  [2]  and  Pthreads  operate 
in  a  single  address  space,  avoiding  the  overheads  of  MPI.  One  address  space  allows  global 
access  to  shared  state  and  communication  to  be  achieved  through  shared  variables.  Compared 
to  other  models,  OpenMP  is  an  automated  parallelization  tool  designed  to  move  the  burden 
of  explicit  parallelization  from  the  programmer  to  the  compiler  using  simple  #pragma  di¬ 
rectives  in  code.  While  combining  threaded  and  message-passing  programming  models  for 
intra-node  and  inter-node  parallelization  may  show  improvements  [11],  support  for  threading 
models  on  SMP  NUMA  systems  is  still  immature.  OpenMP  was  designed  assuming  homoge¬ 
nous  processing  and  uniform  memory  access  architectures  (UMA),  but  this  is  no  longer  true 
as  commodity  processor  technologies  are  becoming  increasingly  NUMA.  Without  data  repli¬ 
cation,  distribution  is  proving  to  be  an  inhibitor  on  software  scalability  with  the  advancement 
of  parallel  hardware  present  in  the  Cielo  supercomputer. 


1  Programmer  and  developer  are  used  interchangeably  in  this  paper. 
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1.2.  Related  Work.  Current  approaches  take  advantage  of  the  virtual  memory  subsys¬ 
tem  that  the  operating  system  kernel  provides  [5],  use  hardware  counters  provided  by  the 
CPU  [13]  as  well  as  combine  techniques  at  the  user-level  [3,  4]. 

User-level  support  for  data  distribution  [3]  combines  knowledge  from  both  a  NUMA- 
aware  memory  manager  and  custom  thread  scheduler,  provided  as  an  extension  to  OpenMP 
to  best  minimize  remote  memory  references.  As  with  our  work,  this  research  advocates  a 
dynamic  runtime  system  that  can  adapt  to  changes  in  thread-data  affinities  throughout  an 
application’s  execution.  In  contrast  to  our  proposed  technique  for  managing  data  distribu¬ 
tion,  this  work  still  requires  applications  to  use  a  supplementary  API  where  ours  aims  for  an 
application-independent  implementation . 

2.  Evaluation.  In  this  section  we  discuss  recent  research  in  literature  and  their  effects 
on  relevant  micro-benchmarks  at  Sandia  as  our  motivation  for  runtime  analysis. 

2.1.  Background.  We  begin  by  giving  some  background  on  virtual  memory  support 
and  use  that  as  discussion  for  the  various  techniques  in  the  literature. 


Task  A  Task  B 


Fig.  2.1.  Virtual  memory  mapping. 


Hardware  and  operating  system  support  for  virtual  memory  allow  for  flexibility  when 
designing  models  for  data  distribution.  Each  process  is  allowed  to  believe  it  has  complete 
control  over  the  entire  system — full  access  to  all  of  memory  and  the  processor.  In  figure  2.1 
we  see  the  state  of  two  processes  at  a  given  point  in  time.  Memory  (both  virtual  and  physical) 
are  divided  into  equal-sized  regions  called  pages.  When  a  process  performs  a  memory  oper¬ 
ation  such  as  a  load  or  a  store  on  data  not  present  in  memory,  the  hardware  traps  the  faulting 
process  and  automatically  transfers  control  to  the  operating  system.  The  operating  system 
then  loads  an  entire  page  of  data  from  an  external  source  into  an  empty  page  in  memory 
and  establishes  a  translation.  Each  translation’s  state  is  represented  by  a  structure  in  memory 
maintained  by  the  kernel — called  a  page  table — containing  among  other  information,  protec¬ 
tion  and  access  bits.  Applications  run  as  a  combination  of  processes  and  threads,  unaware  of 
this  mechanism  that  controls  the  distribution  of  its  data  in  real  hardware. 

Within  a  NUMA  system,  all  processors  and  memory  regions  are  divided  into  domains. 
A  domain  is  defined  as  a  set  of  processors  or  cores  and  a  region  of  memory  between  which 
is  the  lowest  latency.  Figure  2.2  depicts  normalized  latency  costs  of  accessing  all  regions 
of  memory  for  domain  0  in  both  an  UMA  and  four-domain  NUMA  environment.  Should  a 
process  be  scheduled  to  execute  in  domain  0,  any  data  accesses  to  domain  3  would  incur  the 
highest  latency. 

In  the  following  subsections  we  review  micro-benchmarks  and  their  memory  access  pat¬ 
terns,  current  techniques  examined,  finally  concluding  with  a  discussion  on  the  impact  of 
these  techniques  as  motivation  for  a  dynamic  runtime  memory  migration  system. 

2.2.  Benchmarks.  In  this  subsection  we  discuss  two  micro-benchmarks  used  to  evalu¬ 
ate  current  approaches  to  data  distribution,  describe  them  in  terms  of  phases  and  how  current 
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Fig.  2.2.  UMA  vs  NUMA:  memory  latencies  relative  to  domain  0. 


approaches  affect  these  phases.  Both  micro-benchmarks  model  common  characteristics  seen 
in  scientific  code  at  Sandia  and  are  used  as  the  foundation  for  further  study.  Benchmarks 
designed  around  MPI  rewritten  to  use  OpenMP  have  all  explicit  data  movement  removed. 

We  define  a  phase  in  a  multithreaded  application  to  mean  an  interval  of  time  within 
which  data  access  patterns  remain  deterministic  or  constant  for  each  thread,  within  some 
threshold.  A  change  in  the  number  of  threads  also  constitutes  a  phase  change  as  data  access 
characteristics  must  either  be  established  for  newly-spawned  threads,  or  forgotten  with  fewer 
threads. 

The  use  of  a  dynamic  binary  instrumentation  tool  called  Pin  [9]  allows  us  to  capture 
all  instructions  of  an  application  that  access  memory.  With  this  information  we  are  able  to 
visualize  the  data  access  patterns  for  both  micro-benchmarks. 

2.2.1.  Evaluation  System.  Our  evaluation  system  is  a  single  shared-memory  system 
with  two  AMD  Opteron  6174  “Magny-Cours”  processors.  Each  is  a  multichip  module  (MCM) 
containing  two  processor  dies  each  with  six  symmetric  cores  running  at  2.2  GHz,  two  mem¬ 
ory  controllers  rated  at  10.6  GiBps  and  a  local  subset  of  system  memory.  All  four  dies  are 
connected  with  10.4  GiBps  HyperTransport  (HT)  links  in  each  direction,  forming  a  complete 
graph.  All  HT  links  are  16-bit  wide  except  for  two  8-bit  wide  diagonal  crosslinks  connecting 
domains  0-3  and  1-2.  Figure  1.1  illustrates  this  design.  Four  memory  and  processor  do¬ 
mains  are  available,  each  with  8  GiB  of  memory.  Our  study  of  this  processor  is  significant 
as  it  forms  the  basis  of  the  upcoming  Fos  Alamos/Sandia  National  Tabs  “Cielo”  capability 
supercomputer. 

Results  from  current  approaches  were  obtained  on  this  system  running  RHEF  5.4  with  a 
vanilla  2.6.27  kernel  patched  with  support  for  the  next- touch  [5]  memory  policy. 

2.2.2.  STREAM.  STREAM  [10]  is  a  memory  bandwidth  micro-benchmark  parallelized 
with  OpenMP.  Each  thread  executes  multiple  kernels  over  its  subset  of  data  within  three 
global  arrays.  Two  kernels  are  represented  in  our  data,  “copy”  and  “triad”  illustrated  in  figure 
2.3.  STREAM  was  configured  to  use  333  MiB-sized  arrays  to  minimize  caching  effects  (our 
test  system  has  20  MiB  of  effective  last-level  cache  2).  Additionally,  array  elements  are  only 
accessed  within  parallel  regions  to  create  high  affinities  between  threads  and  their  data. 
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Fig.  2.3.  STREAM  micro-kernels. 


2While  the  system  has  24  MiB  of  L3  cache,  the  HT  Assist  optimization  uses  4  MiB  of  L3  to  store  directory 
information. 
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STREAM  Thread  Data  Locality 
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Fig.  2.4.  STREAM per-thread  data  locality  ( entire  execution). 


Each  of  the  three  arrays  are  visible  in  figure  2.4  clearly  indicating  the  regular  access 
patterns  of  all  threads.  A  fourth  line  towards  the  end  of  its  virtual  memory  space  represents 
the  result  vector  storing  timing  measurements.  STREAM  exhibits  this  behavior  throughout 
its  entire  execution. 

2.2.3.  HPCCG.  HPCCG  [6]  is  a  sparse-matrix  vector  multiply  application  with  parallel 
implementations  in  both  MPI  and  OpenMP.  HPCCG  allocates  multiple  arrays  of  data  before 
entering  repeated  iterations  of  its  parallel  sections  where,  like  STREAM,  each  thread  per¬ 
forms  work  on  a  subset  of  data.  Two  phases  have  been  identified  in  this  micro-benchmark, 
both  visualized  in  figure  2.5  for  the  OpenMP  implementation. 

Figure  2.5(a)  depicts  the  master  thread  allocating  and  initializing  all  data.  Figure  2.5(b) 
depicts  the  second  phase  of  HPCCG ’s  execution,  wherein  multiple  parallel  sections  perform 
work  on  subsets  of  the  overall  data.  The  majority  of  work  performed  by  this  micro-benchmark 
is  within  this  phase.  Affinities  between  threads  and  data  are  strong  but  different  in  the  two 
phases. 

2.3.  Techniques.  In  this  subsection  we  discuss  current  approaches  to  data  distribution 
at  or  below  the  operating  system  level  and  investigate  their  advantages  and  disadvantages  as 
they  apply  to  application  phases  seen  in  STREAM  and  HPCCG. 

2.3.1.  First-Touch.  This  is  the  default  policy  in  the  Linux  kernel.  It  is  not  a  solution  to 
the  NUMA  problem  per  se,  but  rather  a  means  to  enable  memory  allocated  by  applications 
to  be  local  to  the  domain  in  which  they  are  executing.  The  Linux  kernel  memory  allocator 
maintains  a  pool  of  memory  for  each  domain  in  the  system.  This  policy  has  no  means  to 
assist  applications  exhibiting  strong  phase  changes,  as  data  is  never  moved  once  allocated. 
Applications  must  be  designed  accordingly  to  minimize  remote  memory  accesses,  requiring 
that  all  data  be  touched  first  by  the  thread  using  it  the  most. 

2.3.2.  Memory  Interleaving.  Memory  interleaving  allows  for  an  even  distribution  of 
memory  over  NUMA  domains  at  the  page  and  cache  line  granularity.  Two  configurations 
were  available  to  us  for  experimentation  and  were  simple  to  configure;  these  are  presented 
in  figure  2.6.  Figure  2.6(a)  illustrates  operating  system  control  over  an  application’s  virtual 
memory  pages.  The  kernel  memory  allocator  maintains  pools  of  memory  for  each  domain, 
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HPCCG  Thread  Data  Locality 


(a)  Phase  1. 

HPCCG  Thread  Data  Locality 


(b)  Phase  2. 


Fig.  2.5.  HPCCG per-thread  data  locality. 


cycling  among  them  when  creating  translations  from  the  virtual  memory  space  of  an  ap¬ 
plication.  Accessing  virtual  memory  linearly  physically  iterates  over  all  available  NUMA 
domains  at  a  page  granularity.  Figure  2.6(b)  illustrates  a  second  method  of  interleaving.  Here 
the  memory  controllers  are  configured  by  the  BIOS  to  modify  its  mapping  of  physical  ad¬ 
dresses  to  machine  memory,  interleaving  them  among  the  NUMA  domains.  In  contrast,  this 
method  operates  at  the  granularity  of  a  cache  line,  is  transparent  to  the  operating  system  and 
allows  for  a  finer  distribution  of  memory  among  the  domains  (interleaving  the  kernel  address 
space  as  well  as  all  application  address  spaces).  Both  methods  distribute  memory  such  that 
the  chance  of  accessing  either  a  page  or  cache  line  locally  is  for  D  domains. 

Page  interleaving  support  is  provided  by  the  numactl  command  line  tool  in  Linux.  It 
allows  for  various  NUMA-related  operations  on  processes,  such  as  memory  and  domain  ex¬ 
ecution  pinning,  CPU  core  pinning  and  memory  interleaving. 
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(a)  Operating  system  page  interleaving  of  VM. 


(b)  BIOS  cache  line  interleaving  of  PM. 


Fig.  2.6.  Two  levels  and  granularities  of  memory  interleaving. 


2.3.3.  Next-Touch.  Next-touch  is  a  memory  policy  implemented  in  kernel  space.  In 
this  scheme,  an  application  flags  regions  of  memory  that  should  be  migrated  to  the  NUMA 
domain  of  the  next  thread  that  accesses  the  region.  By  manipulating  protection  bits  in  the 
page  table  we  force  the  hardware  to  intercept  memory  accesses,  migrating  their  pages  before 
resuming  process  execution.  Recent  work  [5]  provides  this  implementation  as  a  patch  for  the 
2.6.27  Linux  kernel.  The  patch  adds  additional  flags  to  the  madviseO  system  call  enabling 
user-space  applications  to  inform  the  kernel’s  memory  subsystem  to  modify  the  appropriate 
protection  bits  on  a  given  range  of  pages.  On  subsequent  touches  the  hardware  traps  the 
executing  process  and  migrates  pages  to  whichever  domain  the  process  is  executing  on. 

Static  code  analysis  enables  the  programmer  to  identify  when  page  migrations  are  needed. 
One  difficulty  with  this  approach  is  ensuring  that  no  thread  other  than  the  one  intended 
touches  the  pages  it  will  most  heavily  use.  This  assumes  the  appropriate  place  to  call 
madviseO  can  be  determined  through  static  analysis.  Furthermore,  the  data  access  pat¬ 
tern  among  threads  may  change  throughout  an  application’s  lifetime.  For  this  approach  to  be 
effective  each  application  phase  must  be  identified  and  a  call  to  madviseO  inserted. 

Support  for  this  feature  exists  in  the  Oracle  Solaris  9  operating  system  [13],  but  currently 
has  not  been  accepted  into  the  mainstream  Linux  kernel  codebase. 

2.4.  Results.  In  this  subsection  we  present  and  compare  performance  results  for  the 
next-touch,  first-touch  and  memory  interleaving  strategies  with  respect  to  execution  phases 
seen  in  STREAM  and  HPCCG.  Results  are  shown  in  figures  2.8  and  2.7.  Points  plotted  in 
each  of  the  three  graphs  comprise  an  average  of  20  executions  with  error  bars  illustrating  one 
standard  deviation.  We  used  two  strategies  for  pinning  threads,  round-robin  (“Pin  RR”)  and 
ascending  (“Pin  Asc”),  both  tied  to  the  Linux-logical  core  IDs. 

STREAM  performance  data  was  collected  from  a  combination  of  operating  system  thread 
scheduling  and  manual  thread  pinning  in  addition  to  the  first-touch  and  page  interleave  strate¬ 
gies.  Red  curves  in  figures  2.7(a)  and  2.7(b)  show  the  performance  of  STREAM  with  no 
modifications  and  no  runtime  policies.  Variability  is  high  due  to  the  non-deterministic  be¬ 
havior  of  the  2.6.27  Linux  scheduler  and  its  inability  to  identify  affinities  between  data  and 
threads.  It  may  repeatedly  schedule  multiple  threads  for  execution  on  the  same  domain  be¬ 
fore  attempting  to  reduce  oversubscription.  This  behavior  causes  threads  to  access  their  data 
from  various  domains,  causing  scattered  first-touch  allocations  to  occur.  Should  the  sched¬ 
uler  know  to  not  relocate  threads,  first-touch  would  show  the  best  performance.  Forcing 
thread  pinnings  reproduces  this  behavior  and  indeed  shows  the  best  performance,  represented 
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Fig.  2.7.  Performance  of  STREAM  applying  current  data  distribution  techniques. 


by  black  curves  in  both  plots.3  Interleaving  pages  reduces  performance  considerably,  as  the 
probability  of  accessing  local  data  is  reduced  to  25%  from  near  100%  when  compared  with 
thread  pinning  and  first-touch.4  Pinning  threads  in  ascending  order  by  Linux-logical  core  IDs 
shows  a  plateau  (gray  curve)  as  the  first  twelve  core  assignments  alternate  between  domains  0 
and  3,  the  latter  twelve  between  domains  1  and  2.  This  demonstrates  that  assumptions  cannot 
be  made  regarding  the  logical-to-physical  mapping  of  cores  by  the  operating  system.  Next- 
touch  results  were  not  included  as  first-touch  and  thread  pinning  give  the  same  result.  Curves 
in  figure  2.7(b)  are  more  linear  due  to  a  larger  percentage  of  work  coming  from  computational 
overhead.  Overall  trends  are  however  still  visible. 


HPCCG 


Fig.  2.8.  Performance  of  HPCCG  applying  current  data  distribution  techniques. 


HPCCG  performance  data  was  collected  from  the  same  policy  combinations  used  for 
STREAM,  with  the  additional  cache  line  interleaving  technique  and  results  from  the  MPI 
implementation.  The  red  curve  in  figure  2.8  is  the  same  policy  as  the  red  curve  in  figure  2.7, 

3  We  do  not  yet  know  the  cause  of  the  saw-tooth  shape  in  the  curve. 

4  A  bug  was  discovered  in  the  PGI  OpenMP  runtime  library,  preventing  thread  pinning  while  interleaving  pages. 
Unfortunately  this  prevents  accurate  comparisons  as  we  cannot  eliminate  the  scheduler’s  nondeterministic  behavior. 
This  bug  was  reported  to  PGI  and  promptly  fixed;  it  will  be  available  in  the  next  release  of  the  PGI  compiler  suite. 
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with  no  code  modifications  or  runtime  policies.  This  behavior  is  attributed  to  the  changes 
in  the  data  access  patterns  mentioned  in  section  2.2.3:  the  existence  of  phase  1  forces  the 
first-touch  policy  to  allocate  all  memory  pages  on  single  memory  domain.  Transitioning  to 
phase  2  shows  an  increase  in  threads,  all  of  which  are  relocated  by  the  scheduler  to  CPU 
cores  in  other  domains,  to  minimize  oversubscription.  In  doing  so  memory  references  for 
75%  of  all  threads  become  almost  entirely  non-local.  In  contrast,  an  MPI  implementation 
scales  much  better  due  to  data  duplication,  yet  performance  tapers  off  with  many  threads 
showing  large  variations  most  likely  attributed  again  to  the  nondeterministic  behavior  of  the 
Linux  scheduler.  Without  data  distribution  in  a  threaded  environment  first-touch  prevents 
applications  with  phases  similar  to  HPCCG  from  scaling  on  NUMA  hardware. 

Having  identified  both  phases  of  HPCCG,  we  were  able  to  effectively  use  next-touch  by 
placing  appropriate  calls  to  madviseO  at  the  end  of  phase  1.  Combined  with  thread  pinning 
this  method  showed  the  best  overall  performance  as  memory  pages  were  migrated  to  domains 
in  which  the  accessing  threads  were  executing,  enabling  nearly  all  memory  references  to  be 
local.  Page  and  cache  line  interleaving  also  improve  performance  as  expected,  with  cache 
line  interleaving  showing  slightly  higher  performance  due  to  the  smaller  granularity. 

3.  Future  Work.  An  application’s  data  access  patterns  change  throughout  execution. 
Approaches  discussed  in  prior  sections  function  in  a  one-shot  manner,  continuously  apply  the 
same  rule  or  require  user  interaction.  We  therefore  propose  a  dynamic  runtime  system  for 
monitoring  an  application’s  memory  access  patterns  from  within  the  kernel.  Active  moni¬ 
toring  allows  the  operating  system  to  observe  affinities  between  an  application’s  threads  and 
pages  in  memory,  migrating  pages  to  reduce  remote  memory  references. 


(a)  Multiple  page  tables  for  monitoring  page  access  rates. 


"^m®optimized 


(b)  Effect  of  runtime  profiling  on  an  application. 


Fig.  3.1.  Proposed  dynamic  runtime  system  for  page  migration  within  the  kernel. 


An  operating  system  typically  allocates  one  page  table  per  process.  By  using  one  page 
table  per  domain  per  process  we  will  be  able  to  capture  where  accesses  originate  and  what 
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regions  of  memory  they  reference,  as  seen  in  figure  3.1(a).  The  appropriate  page  table  is  in¬ 
stalled  on  a  context  switch,  and  access  bits  within  page  table  entries  updated  by  the  hardware 
will  be  read  to  monitor  an  application’s  data  access  patterns.  A  kernel  thread  will  periodi¬ 
cally  scan  these  entries  to  observe  page  access  patterns,  identify  frequent  accesses  of  non¬ 
local  memory  and  migrate  pages  accordingly.  Frequent  scanning  will  be  required  as  only 
one  access  bit  is  available  per  page,  potentially  increasing  profiling  overhead.  To  prove  itself 
advantageous,  our  approach  must  ensure  the  savings  in  execution  time  of  the  application  are 
greater  than  the  combined  cost  of  data  migration  and  active  profiling — figure  3.1(b).  We  ar¬ 
gue  that  this  approach  will  allow  for  a  more  flexible  and  customizable  solution  to  the  problem 
of  data  distribution  on  NUMA  hardware. 

Overheads  introduced  by  this  mechanism  can  be  reduced  if  implemented  in  a  light¬ 
weight  kernel  such  as  Kitten  [7] .  Kitten  creates  a  complete  linear  mapping  of  all  virtual  mem¬ 
ory  addresses  on  process  creation,  in  other  words,  pre-populating  all  page  tables.  Knowing 
the  locations  of  all  last-level  page  table  entries  can  reduce  the  execution  overhead  introduced 
from  frequent  access  bit  scanning  by  avoiding  full  page  table  traversals.  Kitten  furthermore 
supports  the  use  of  large  pages  to  map  virtual  memory.  Enabling  this  will  reduce  the  height 
of  each  page  table  and  lower  the  number  of  page  table  entries  needed.  Using  our  proposed 
approach,  each  process  will  consume  four  times  the  amount  of  memory  needed  to  store  their 
address  translations;  large  pages  will  reduce  this  footprint. 

Our  approach  is  a  mechanism  that  will  require  various  policies  to  show  any  benefit. 
Varying  the  profiling  frequency  and  defining  intervals  for  page  migration  and  the  meaning 
of  “heavy”  page  access  will  have  to  be  adjusted  to  determine  which  combinations  give  the 
best  results  across  a  range  of  application  phases.  Research  in  the  early  nineties  examined  this 
[8],  but  on  different  hardware  and  with  a  different  application  domain.  It  was  determined  that 
no  single  policy  proved  beneficial  across  all  kernels  in  their  benchmark  suite.  Our  goal  is  to 
re-examine  this  on  modern  hardware  across  a  more  appropriate  set  of  kernels. 

Future  support  in  hardware  may  include  widening  the  access  bit  field  in  each  page  table 
entry  and  modifying  the  processor  to  treat  this  field  as  a  saturating  counter  instead  of  a  bit 
flip.  While  beyond  the  scope  of  this  research,  it  would  reduce  the  execution  overhead  from 
profiling,  by  lowering  the  frequency  at  which  page  table  entries  need  to  be  scanned. 

4.  Conclusion.  In  this  paper  we  examined  several  existing  techniques  for  managing  data 
distribution  in  a  multicore  NUMA  environment,  the  basis  for  the  upcoming  “Cielo”  capability 
supercomputer.  As  scientific  applications  are  becoming  increasingly  hybrid,  incorporating 
threaded  programming  models  within  nodes,  support  for  data  distribution  becomes  a  limit 
on  intra-node  scalability.  We  demonstrated  the  static  nature  of  current  techniques,  requiring 
time-consuming  code  modifications  or  user  intervention.  With  tools  developed  to  visualize 
application  data  access  patterns  we  are  better  able  to  identify  phase  changes  in  thread-data 
affinities,  enabling  a  more  complete  understanding  of  our  application  domain.  Combined 
with  an  evaluation  of  our  proposed  kernel-level  dynamic  runtime  technique  we  demonstrate 
the  need  for  an  invisible  and  adaptive  data  distribution  model,  empowering  systems  software 
to  continue  to  scale  as  we  approach  exascale  computing. 
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INTEGRATING  ROUTER  POWER  MODELS  INTO  THE  STRUCTURAL 
SIMULATION  TOOLKIT 

KEVIN  D.  THOMPSON^,  ARUN  F.  RODRIGUES11,  AND  MING- YU  HSIEH** 

Abstract.  As  the  Structural  Simulation  Toolkit  (SST)  grows  to  encompass  a  large  assortment  of  hardware 
components,  the  need  to  model  power  consumption  becomes  very  clear.  Networking  hardware,  specifically  routers, 
consumes  a  very  measurable  amount  of  power.  SST  needs  an  interface  to  capture  this  data  and  report  it  to  the  user 
in  a  simple  and  friendly  manner.  We  intend  to  utilize  existing  power  modeling  toolkits,  specifically  McPAT  and 
ORION  2.0,  into  SST’s  router  models  in  order  to  get  a  reasonably  accurate  idea  of  the  power  being  consumed. 

1.  Introduction.  The  Structural  Simulation  Toolkit  (SST) [4],  is  used  to  explore  archi¬ 
tectural  options  when  designing  the  next  generation  of  supercomputer.  Its  highly  modular 
framework  allows  it  to  support  a  wide  range  of  customized  models  and  system  parameters. 
SST  allows  hardware  and  software  to  be  developed  and  modified  separately,  neither  limiting 
the  other.  It  then  links  the  programming  model  to  the  hardware  organization  and  relays  valu¬ 
able  performance  data  to  the  user  who  needs  to  design  a  very  powerful,  yet  efficient  and  cost 
effective,  system. 

Given  that  power  is  now  the  major  limitation  of  computers [1],  SST  needs  to  have  a  way 
to  monitor  the  approximate  power  usage  of  a  certain  application  being  run  on  the  proposed 
architecture.  The  team  here  at  Sandia  is  working  on  integrating  accurate  power  modeling 
to  each  main  component  of  the  toolkit.  Interconnection  networks  have  been  estimated  to 
consume  roughly  20-40%  of  a  system’s  total  power  budget[5].  We  need  a  way  to  monitor  and 
limit  that  power  consumption.  When  considering  cost/performance  trade-offs,  power  models 
integrated  into  the  networking  components  of  SST  will  allow  the  developer  to  consider  how 
varying  the  topology,  port  count,  clock  rate,  etc.  would  affect  overall  consumption  of  power. 

We  have  identified  two  well-respected  power  modeling  tools  to  use  with  SST:  Mc¬ 
PAT  (Multicore  Power,  Area,  and  Timing),  a  general  purpose  modeling  framework  from 
HP  Labs[3],  and  ORION  2.0  (Open  Research  Infrastructure  for  Optimizing  Networks),  a 
networks-on-chip  (NoC)  specific  power  model  suite  from  Princeton  University [2].  In  this 
paper,  we  will  discuss  a  little  background  on  these  power  models  and  why  they  fit  our  needs 
(Section  2).  We’ll  then  go  over  the  design  and  integration  of  the  power  models  into  SST  and 
the  network  components  (Section  3)  and  discuss  the  initial  results  obtained  through  simula¬ 
tions  thus  far  (Section  4).  Finally,  we  will  address  future  work  that  still  needs  to  be  completed 
before  final  production  (Section  5). 

2.  Power  Models.  The  two  power  models  we  integrated  to  measure  network  power 
were  McPAT[3]  and  ORION  2.0[2].  Both  of  these  power  modeling  tools  have  been  written  to 
account  for  networks-on-chip  (NoC)  power.  McPAT  is  a  general  power  model  that  computes 
power,  area,  and  timing  for  multicore  processors;  NoC  power  modeling  is  only  one  fraction  of 
McPAT ’s  modeling  capabilities.  For  this  reason,  McPAT  is  a  very  attractive  power  modeling 
tool;  not  only  can  we  use  it  for  this  research,  but  it  will  be  (and  has  been)  used  to  model  power 
in  many  more  of  SST’s  components.  ORION,  on  the  other  hand,  is  designed  specifically  for 
power  analysis  of  NoC.  It  utilizes  many  low-level  customizable  parameters  and  we  anticipate 
it  providing  very  specific  and  accurate  results.  It  should  be  noted  that  McPAT  does  account 
for  short-circuit  power,  which  is  ignored  by  ORION.  We  may  try  to  add  some  short-circuit 
power  modeling  capabilities  to  ORION  in  the  future. 
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2.1.  McPAT.  McPAT  has  been  used  as  the  primary  power  modeling  toolkit  for  SST. 
McPAT  offers  accurate  multicore  and  manycore  modeling  and  can  “accurately  scale  circuit 
models  into  deep-submicron  technologies.” [3]  McPAT  models  dynamic,  static,  and  short- 
circuit  power  to  fully  encompass  all  sources  of  power  dissipation.  It  has  an  inherent  XML- 
based  interface  which  makes  it  very  easy  to  parse  the  high-  and  low-level  parameters  specified 
in  each  component’s  XML  file.  McPAT ’s  framework  is  presented  as  a  block  diagram  shown 
in  Figure  2.1.  The  data  is  input  through  an  XML  interface  which  McPAT ’s  core  reads  and 
analyzes.  Timing,  area,  and  power  data  are  then  output  to  the  user.  This  data  can  even 
be  output  in  real-time  so  that  a  sophisticated  simulator  can  make  real-time  adjustments  as 
necessary.  In  this  paper,  we  are  only  concerned  with  the  power  data  output.  As  seen  in 
the  diagram,  power  dissipation  is  broken  into  its  three  main  sources:  dynamic,  leakage,  and 
short-circuit  power.  This  separation  helps  us  to  determine  what  parameters  we  should  change 
to  optimize  power  consumption.  In  Section  4,  we  will  present  some  sample  output  data. 


Fig.  2.1.  Block  diagram  of  the  McPAT  framework.  [3] 


McPAT  breaks  down  the  router  model  into  its  main  components:  flit  buffers,  arbiters, 
and  crossbars  are  the  big  ones.  As  seen  in  Figure  2.2,  McPAT  uses  a  traditional  router  with 
four  stages  to  its  pipeline:  routing  computation,  virtual  channel  allocation,  switch  allocation, 
and  switch  transversal.  The  parameters  defined  by  the  user  determine  the  size  and  number 
of  the  router’s  main  components.  SST  sends  McPAT  a  signal  letting  it  know  every  time 
the  router  is  accessed.  Using  the  predefined  parameters,  it  then  calculates  the  power  used  by 
each  hardware  component.  A  more  detailed  discussion  of  McPAT’s  power  analysis,  including 
formulas  and  figures  can  be  found  in  its  reference  manual[3].  McPAT  was  validated  against 
Niagara,  Niagara2,  Alpha  21364,  and  the  Xeon  Tulsa  processors.  The  difference  between  its 
models  and  the  reported  data  was  between  10-20%. 

2.2.  ORION.  ORION  had  not  previously  been  incorporated  into  SST  as  McPAT  had. 
We  identified  it  as  a  good  “second  opinion”  power  modeling  tool  to  use  with  SST’s  network¬ 
ing  components.  ORION  has  been  proven  to  work  with  many-core  chips,  such  as  Intel’s 
80-core  Teraflops  chip  as  well  as  their  Scalable  Communications  Core  (SCC)  chip.  With  so 
many  cores  on  a  single  chip,  it  is  obvious  that  interconnect  networks  are  playing  a  major  role 
in  future  systems.  The  other  interesting  aspect  of  ORION  is  that  its  developers  have  included 
“a  semi-automated  flow  that  automatically  updates  its  models  as  new  technology  files  be¬ 
come  available.”  [2]  This  feature  allows  ORION  to  stay  up  to  date  with  minimal  labor  for  the 
user.  Since  SST’s  main  role  will  be  helping  to  facilitate  the  design  of  future  supercomputers, 
we  need  power  modeling  tools  that  are  always  current  and  up  to  the  task  at  hand.  Figure 
2.3  shows  ORION’s  modeling  framework.  Like  McPAT’s  framework  in  Figure  2.1,  it  out¬ 
puts  area  and  dynamic  and  leakage  power  (note  that  it  doesn’t  output  timing  or  short-circuit 
power).  ORION  was  validated  to  within  7%  and  11%  error  of  the  Intel  80-core  and  SCC 
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Fig.  2.2.  McPAT’s  router  model.  [3] 
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Fig.  2.3.  ORION  2.0  modeling  methodology.  [2] 


3.  Design  and  Integration.  Since  McPAT  had  been  previously  integrated  for  other  pur¬ 
poses,  our  main  design  tasks  were  focused  on  ORION  and  the  router  model  within  SST.  We’ll 
discuss  briefly  how  McPAT  was  customized,  as  well  as  the  process  required  to  get  ORION 
working  with  SST.  We’ll  then  go  over  the  additions  to  get  the  router  model  to  call  and  utilize 
these  libraries. 

3.1.  Integrating  McPAT.  McPAT’s  source  files  were  modified  to  make  it  possible  for  it 
to  communicate  effectively  with  SST.  Functions  were  added  in  the  format  SSTreturnXXXO , 
where  XXX  is  the  name  of  the  subcomponent  in  McPAT’s  core.  These  are  the  functions  respon¬ 
sible  for  returning  McPAT  component  objects  with  unit  energy  to  SST.  It  was  also  modified 
to  get  its  data  from  the  XML  file  of  the  component  being  monitored  as  well  as  the  original 
McPAT  XML  file  that  keeps  default  parameter  values  and  instructions.  There  is  a  power  API 
within  SST  that  manages  all  of  the  interactions  between  the  component  being  monitored  and 
the  power  library  being  used.  It  is  the  middle-man  controlling  the  transaction. 
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3.2.  Integrating  ORION.  ORION  had  not  been  previously  implemented,  so  we  needed 
to  modify  it  from  scratch.  It  is  written  in  C,  while  McPAT  and  SST  are  mainly  coded  in 
C++.  There  are  several  functions  used  to  keep  track  of  the  power  being  used  by  the  router 
model.  The  first  one  to  be  called  is  SST_SIM_router_init()  which  sets  up  the  parame¬ 
ters  in  ORION  based  on  a  combination  of  user  input  and  default  values.  Next,  there  are 
two  avenues  to  take.  The  first  is  to  simply  run  SST_SIM_router_stat_power()  which  will, 
based  on  the  previously  set  parameters,  compute  the  average  power  estimated  to  be  used  by 
the  model  in  question.  The  second  one  uses  four  functions:  SIM_buf_power_data_read(), 
SIM_buf_power_data_write(),  SIM_crossbar_record(),  and  SIM_arbiter_record(). 
Every  time  a  buffer  read/write,  crossbar  activity,  or  arbiter  activity  occurs  in  the  model,  the 
corresponding  function  is  called  and  power  data  is  accumulated. 

3.3.  Implementation  in  the  Router  Model.  Now  that  the  power  libraries  have  been  set 
up,  we  need  to  link  the  router  model  with  SST’s  power  API  to  drive  the  calculations.  SST  uses 
an  XML  file  generated  by  the  router  code  to  run  the  simulation.  We  modified  this  generation 
of  the  XML  file  to  include  the  proper  parameters  required  by  McPAT  and  ORION  to  run  a 
simulation.  We  set  up  a  test  router  that  was  configured  with  the  initial  parameters  outlined  in 
Table  3.1.  The  parameters  shown  are  only  the  most  critical.  There  are  a  huge  number  that 
can  be  customized  in  addition  to  these.  We  will  vary  these  parameters  to  obtain  a  graph  that 
will  be  presented  in  Section  4.  Linally,  the  router  model  needs  to  communicate  its  settings  at 
run-time  to  the  power  libraries.  This  is  accomplished  by  sending  the  appropriate  data  to  the 
power  API  which  forwards  it  to  McPAT  or  ORION.  We  will  briefly  discuss  the  critical  lines 
of  code  and  their  operation. 


Table  3.1 

Test-router  parameters  used. 


clock  rate 

1.0  GHz 

supply  voltage 

1.2  V 

topology 

ring 

input  ports 

8 

output  ports 

8 

flit  bits 

128 

virtual  channels  per  port 

2 

The  code  shown  in  Ligure  3.1  is  called  during  every  router  event.  The  function 
resetCountsO  is  called  to  reset  all  of  the  counts  to  zero  to  assure  accurate  accumulation. 
The  integer  mycounts .  router_access  defines  how  many  times  the  router  is  accessed.  In 
our  model,  it  is  accessed  twice:  once  to  receive  a  flit,  and  once  to  send  it  on  to  its  desti¬ 
nation.  getPowerO  sends  the  data  to  either  McPAT  or  ORION  (whichever  is  specified  in 
the  XML  file)  where  it  is  analyzed,  power  is  computed,  and  data  is  returned  to  SST.  Linally, 
regPowerStatsO  records  the  data  returned.  The  total  power,  leakage  power,  and  dynamic 
power,  are  accumulated  at  every  call,  until  they  are  reported  to  the  user  at  the  termination  of 
the  simulation. 


power->resetCounts(&mycounts) ; 
mycounts . router_access=2 ; 

pdata  =  power->getPower(this ,  ROUTER,  mycounts); 
regPowerStats(pdata) ; 


Fig.  3.1.  Sample  calls  from  the  router  model. 
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4.  Results.  Since  our  model  integration  has  not  yet  been  validated,  we  will  simply 
present  some  sample  output  as  it  would  be  seen  and  applied  by  the  user.  We’ll  run  each 
simulation  varying  the  number  of  ports  and  the  flit  width  to  demonstrate  how  power  con¬ 
sumption  changes  by  varying  critical  parameters.  Except  for  the  parameter  being  varied,  the 
rest  of  the  parameters  will  be  constant  as  listed  in  Table  3.1. 

4.1.  Sample  output.  We  ran  SST’s  router  model  using  the  parameters  defined  in  Ta¬ 
ble  3.1,  with  McPAT’s  power  libraries.  The  results  are  shown  in  Figure  4.1.  Our  McPAT 
simulation  breaks  the  total  power  down  into  leakage  and  runtime  power. 


current  total  power  =  1.19377  ±  5.96883e-©6  W 
leakage  power  =  0.479309030943142  W 
runtime  power  =  0.714458  3.57229e-06  W 


Fig.  4. 1 .  Router  power  dissipation  output  by  McPAT. 


We  also  ran  the  router  model  with  the  same  parameters  using  ORION ’s  power  libraries. 
The  results  are  shown  in  Figure  4.2.  ORION  breaks  its  power  dissipation  down  into  five 
subcomponents  so  that  the  user  has  a  better  idea  of  where  the  most  power  is  being  consumed 
and  of  what  changes  to  the  design  would  be  best  to  make. 


Buffer : 0 . 195698  W 
Crossbar:©. 730277  W 
VC_allocator : 0.00366383  W 
SW_allocator:0. 0119834  W 
Clock:© .0932492  W 
Total  router  power : 1 .03487  W 


Fig.  4.2.  Router  power  dissipation  output  by  ORION. 


These  results  are  each  one  of  several  data  points  used  to  produce  the  comparison  graphs 
in  Figure  4.3.  All  parameters  were  left  constant  except  for  the  number  of  ports  and  the  flit 
width.  In  graph  4.3(a),  the  flit  width  was  left  at  128  bits  for  all  iterations  while  the  number 
of  input/output  ports  was  varied  from  two  to  twelve.  By  increasing  the  number  of  ports, 
power  dissipation  grows  at  a  pretty  steady  rate.  Designers  will  want  to  keep  the  number  of 
ports  and  routers  under  control  during  the  design  process.  SST  allows  a  designer  to  make 
trade-offs  during  development.  For  instance,  increasing  the  number  of  ports  while  decreasing 
the  number  of  routers  may  ultimately  result  in  less  power  consumption.  In  graph  4.3(b),  the 
input/output  ports  were  left  at  eight  a  piece,  while  the  flit  width  was  varied  from  8  bits  to  256 
bits.  This  shows  quite  modest  growth  from  8  to  128  bits,  but  256  bits  shows  a  substantial  leap 
in  power  dissipation.  Obviously  if  designers  are  thinking  of  using  a  flit  width  over  64  bits, 
they  need  to  really  consider  whether  the  performance  boost  would  be  worth  the  huge  power 
increases.  Valuable  information  can  be  gained  by  running  these  simple  simulations. 

The  discrepancy  between  McPAT’s  and  ORION’s  power  estimations  is  quite  small,  and 
they  both  grow  at  about  the  same  rate.  This  is  a  good  sign.  By  using  two  totally  different 
power  models,  and  getting  similar  results,  we  are  on  our  way  to  obtaining  a  good  estimation 
of  power  dissipation  (this  is,  of  course,  dependent  upon  the  accuracy  of  the  parameters  fed 
to  the  models).  Each  has  its  advantage  over  the  other.  McPAT  makes  it  easy  to  model  more 
than  just  network  components.  By  using  CPUs  with  NICs  and  routers,  we  can  simply  call  one 
power  library  and  get  an  overall  power  dissipation  report.  On  the  other  hand,  ORION  does  a 
good  job  of  breaking  down  each  of  the  main  router  components  and  its  parameters  are  much 
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(b)  Varying  the  flit  width. 


Fig.  4.3.  ORION  and  McPAT power  dissipation. 


more  customizable  than  McPAT ’s  for  potentially  greater  accuracy.  When  a  detailed  report  of 
power  dissipation  in  the  router  is  needed,  ORION  will  be  the  way  to  go.  It  can  always  be 
used  in  parallel  with  McPAT  as  an  accuracy  check. 

5.  Future  Work.  The  power  modeling  calls  within  the  router  model  will  likely  need  to 
be  further  optimized.  We  can  achieve  this  through  validation  steps.  Our  results  have  not  yet 
been  validated  with  a  real  model.  Validation  needs  to  be  completed  before  the  models  can 
be  officially  released  in  a  public  version  of  SST.  The  plan  is  to  obtain  router  parameters  from 
real  routers  with  published  data.  We  can  then  run  these  parameters  through  our  power  API 
and  find  out  how  the  accuracy  of  our  integrated  power  models  relates  to  what  it  should  be. 
The  results  should  be  similar  to  those  previously  published  for  McPAT  and  ORION  in  [3]  and 

[2]  if  everything  was  incorporated  correctly.  Another  component  closely  related  to  the  router 
is  the  network  interface  controller  (NIC)  model.  It  is  also  responsible  for  a  large  portion  of 
the  power  consumed  by  the  network  and  should  be  modeled  accordingly.  This  area  may  also 
include  the  modeling  of  inter-router  links  which  can  consume  up  to  30%  of  total  NoC  power 

[3] .  Both  McPAT  and  ORION  have  the  capabilities  to  model  inter-router  links.  Link  length 
should  be  known,  so  we’ll  need  a  good  algorithm  to  properly  calculate  these  lengths  before 
we  can  model  it  accurately. 

6.  Conclusions.  This  work  has  paved  the  way  for  more  detailed  power  estimation  to  be 
accomplished.  Now  cost/performance  graphs  can  be  more  easily  attained  to  determine  and 
identify  necessary  trade-offs  during  initial  design.  Soon,  SST  will  be  publicly  available  to 
both  academia  and  industry.  It  is  highly  anticipated  to  be  a  very  useful  and  widely  used  tool  in 
supercomputer  design.  Since  future  many-core  systems  will  rely  heavily  on  interconnection 
networks,  designers  will  need  a  quick  and  easy  way  to  determine  their  cost.  This  work  has 
laid  a  foundation  for  those  goals  to  be  obtained.  Using  McPAT  and  ORION  2.0,  we  can  get  a 
quick  and  accurate  idea  of  the  power  being  consumed. 
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PROCESS  LAYERS  FOR  DISCRETE  EVENT  SIMULATION  OF  COMPUTER 

SYSTEMS 

CHAD  D.  KERSEY*  AND  ARUN  F.  RODRIGUES1 

Abstract.  Programming  environments  for  discrete  event  simulation  often  provide  a  process-like  abstraction 
for  their  users.  A  new  class  of  parallel  discrete  event  simulation  kernels  created  for  the  large-scale  simulation  of 
computer  hardware  provide  only  a  minimal  set  of  primitives  for  interaction  with  the  simulator.  For  certain  problems, 
the  ability  to  use  processes  greatly  improves  programmer  productivity  and  code  readability,  at  the  cost  of  the  lower 
performance  and  increased  memory  use  imposed  by  the  process  implementation.  In  this  paper,  we  introduce  two 
implementations  of  processes  for  these  frameworks,  provide  a  case  study  to  demonstrate  the  advantages  and  costs  of 
using  processes,  and  propose  methods  for  mitigating  their  costs.  Similar  software  layers  have  been  in  use  for  years 
to  ease  the  programming  of  event  driven  real  time  systems  and  network  servers,  and  it  is  our  goal  to  provide  similar 
capabilities  for  the  high-performance  simulation  of  computer  systems. 


1.  Introduction.  A  multitude  of  purpose-built  programming  environments  for  discrete 
event  simulations  have  emerged  over  the  years,  typically  featuring  processes  as  a  fundamental 
abstraction.  CSIM  [12]  and  SimPy  [13]  exemplify  pure  process-oriented  simulation  environ¬ 
ments.  In  both  of  these  systems,  the  basic  abstraction  presented  to  the  user  is  a  collection  of 
sequentially-executing  communicating  processes.  Simulation  time  passes  only  while  these 
processes  are  blocked  and  stands  still  while  they  are  running.  Common  blocking  primitives 
are  waiting  for  a  given  amount  of  simulation  time  to  elapse  or  waiting  for  an  event  to  be 
triggered  by  another  process. 

A  new  class  of  parallel  simulation  kernels  created  for  the  large-scale  simulation  of  com¬ 
puter  systems  and  exemplified  by  the  Manifold  simulation  kernel  from  Georgia  Tech  and 
the  Structural  Simulation  Toolkit  (SST)  simulation  kernel  from  Sandia  National  Laborato¬ 
ries  [11],  provide  only  a  minimal  set  of  primitives  for  interaction  with  the  simulator,  leaving 
implementations  of  more  expressive  constructs  to  other  software  layers.  By  leaving  the  im¬ 
plementation  of  processes  out  of  the  kernel,  not  only  is  the  design  of  the  simulation  kernel 
simplified,  but  the  programmer  is  given  the  freedom  to  choose  the  appropriate  tool  for  their 
task.  It  is  expected  that  multiple  implementations  of  processes  and  other  high-level  abstrac¬ 
tions  will  be  used  simultaneously  to  build  large-scale  simulations. 

2.  Background. 

2.1.  Discrete  Event  Simulation.  The  design  of  computer  systems  has  been  recognized 
as  an  appropriate  problem  domain  for  discrete  event  simulation  since  1974  at  the  latest  [10]. 
Through  the  use  of  accurate  simulations,  it  is  possible  to  demonstrate  effects  that  only  become 
apparent  at  the  system  level,  when  all  of  the  components  are  combined,  and  evaluate  solutions 
to  performance  limiting  behavior. 

A  major  challenge  to  the  performance  of  the  simulators  themselves  is  implementing  cor¬ 
rect  and  efficient  parallel  discrete  event  simulation.  As  discussed  by  Fujimoto  in  [7],  two 
classes  of  simulation  methods  exist:  conservative  approaches,  limited  by  the  look-ahead  be¬ 
tween  parallel  logical  processes  (which  are  distinct  from  the  simulator  processes  we  imple¬ 
ment  here!),  and  optimistic  approaches,  which  detect  causality  violations  after  they  occur  and 
roll  back. 

Broadly,  discrete  event  simulators  can  be  categorized  as  process- based  or  event-based. 
A  process  in  this  sense  is  a  component-level  construct  that  is  able  to  persist  over  a  span  of 
simulation  time.  Event  handlers,  on  the  other  hand,  are  executed  entirely  within  a  single 
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time  step.  While  processes  are  running,  no  simulation  time  passes,  but  there  are  a  variety  of 
blocking  calls  that  they  can  make,  during  which  simulation  time  can  advance.  The  two  most 
common  of  these  are  waiting  on  a  condition  variable  to  be  triggered  by  an  asynchronous  event 
and  delaying  for  a  given  period  of  simulation  time. 

These  two  blocking  operations  are  available  in  discrete  event  simulation  frameworks 
like  SimPy  and  CSIM  [13,  12],  and  also  in  the  widely  used  VHDL  [1]  hardware  description 
language.  VHDL  processes  work  the  same  way  as  SimPy  or  CSIM  processes,  but  are  not 
the  only  abstraction  provided  by  the  simulation  environment;  it  is  possible  to  build  detailed 
models  of  hardware  in  VHDL  without  using  processes.  Similarly,  the  software  layers  for 
implementing  processes  presented  here  are  available  in  addition  to,  not  instead  of,  the  ability 
to  define  individual  event  handlers. 

2.2.  SST  and  Manifold.  SST  and  Manifold  both  approach  parallel  discrete  event  simu¬ 
lation  conservatively,  dividing  the  simulation  into  a  set  of  composable  components  connected 
by  links.  In  SST,  these  components  are  automatically  partitioned  between  parallel  threads  of 
execution  (MPI  ranks),  while  in  Manifold  the  partitioning  must  be  manually  performed  by 
the  user.  Links  are  associated  with  a  latency,  which  guarantees  that  the  lookahead  value,  the 
amount  of  simulation  time  that  one  rank  can  run  ahead  of  another  rank,  with  no  possibility 
of  a  causality  violation,  between  any  two  ranks  is  the  minimum  of  the  link  latencies  between 
them. 

The  API  for  writing  a  component  for  either  of  these  simulators  is  simply  the  implemen¬ 
tation  of  a  set  of  event  handlers  and  a  constructor.  Though  both  simulators  also  provide  a 
concept  of  a  periodic  clock  in  addition  to  events,  this  can  be  viewed  as  an  optimization  of  the 
event  scheduling  system  for  components  written  with  a  time  stepped  paradigm.  While  these 
simulation  kernels  do  not  use  processes  in  the  sense  of  a  single  logical  process  per  MPI  rank, 
there  is  no  notion  of  a  process  from  the  point  of  view  of  the  component  implementer. 

SST,  in  development  at  Sandia  National  Laboratories  [11],  provides  a  parallel  simula¬ 
tion  kernel  along  with  a  set  of  components ,  which  can  be  combined  through  the  use  of  an 
XML  system  description  language  into  a  full  simulation.  The  component  graph  is  partitioned 
automatically  across  MPI  ranks,  which  correspond  to  what  Fujimoto  referred  to  as  logical 
processes  [7],  in  order  to  maximize  lookahead. 

In  order  to  provide  edges  representing  lookahead  in  the  component  graph  so  that  it  can  be 
partitioned  to  maximize  lookahead,  components  in  SST  cannot  arbitrarily  schedule  events  in 
the  simulator  for  any  other  component.  A  link  must  be  defined  in  the  SDL  file  between  these 
components  with  a  corresponding  latency.  This  latency  represents  the  amount  of  elapsed 
simulation  time  between  an  event  being  created  and  an  event  arriving  at  its  destination  com¬ 
ponent. 

The  Manifold  project  at  Georgia  Tech  is  similar  in  its  scope  and  goals  to  SST,  but  uses 
slightly  different  semantics  to  simplify  design.  In  Manifold,  processes  within  the  same  rank 
may  communicate  by  scheduling  function  calls  on  one-another  directly,  without  the  use  of 
links.  Only  for  communication  between  ranks  are  links  necessary,  with  partitioning  of  com¬ 
ponents  among  ranks  performed  manually. 

Both  Manifold  and  SST  share  an  event-based  programming  model.  Event  handlers 
(which  much  be  configured  as  such  in  SST  but  can  be  any  void  function  in  Manifold)  execute 
entirely  within  a  single  simulation  time  step.  If  an  event  handler  cannot  proceed  in  its  task 
until  a  condition  is  satisfied,  it  must  save  its  state  in  some  known  location  from  which  the  task 
will  be  resumed  when  the  condition  is  satisfied.  If  its  task  requires  a  delay  for  a  given  period 
of  simulation  time,  it  must  schedule  a  future  event.  Both  of  these  tasks  require  splitting  a 
logically  continuous  process  among  multiple  events. 

As  a  simple  example,  a  Poisson  process  traffic  generator  written  in  an  event  oriented 
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manner  might  be  implemented  in  pseudocode  as: 

void  event_a_handler()  { 
if  (running)  { 

generate_packet() ; 

schedule (exp_rn(MTBP) ,  new  event_a()); 

} 

} 

constructorO  { 

schedule (exp_rn(MTBP) ,  new  event_a()); 

} 

An  identical  traffic  generator  implemented  with  processes  might  look  like: 

process_main()  { 

pause (exp_rn(MTBP)) ; 
while  (running)  { 
generate_packet() ; 
pause (exp_rn(MTBP)) ; 

} 

} 

constructorO  { 

spawn_process(process_main) ; 

} 

Notice  that  the  control  flow  structure  has  been  improved  by  allowing  the  use  of  a  while 
loop  for  a  repeated  action,  instead  of  requiring  that  the  new  events  be  manually  scheduled  as 
long  as  running  is  true. 

SST  imposes  an  additional  requirement  on  component  writers,  that  the  components  they 
create  must  be  serializable ;  they  must  provide  a  way  to  save  and  restore  their  state  to  a 
byte  stream.  This  is  important  for  the  largest-scale  simulations  to  avoid  the  consequences 
of  hardware  failure  during  simulation.  This  requirement  has  far-reaching  consequences  in 
component  implementation  that  impact  the  implementation  of  process  layers  for  use  with 
SST. 

2.3.  Event-Driven  Programming.  The  limited  component  implementation  API  of  SST 
and  Manifold  is  similar  to  APIs  seen  in  event-driven  programming  environments.  Long  used 
for  embedded  systems,  Ul-driven  systems,  and  high  performance  servers,  the  event  driven 
paradigm  provides  an  efficient  way  for  software  to  interact  with  asynchronous  external  events, 
advancing  as  long  as  there  are  events  left  to  process.  A  common  complaint  among  users 
of  event-driven  systems  is  that  it  is  difficult  to  maintain  state  across  a  series  of  events,  for 
instance  to  maintain  a  session  when  implementing  a  network  protocol.  Many  solutions  have 
been  proposed  to  circumvent  this  while  maintaining  most  of  the  efficiency  of  events  [5,  6,  9, 

14]. 


2.4.  QSim  and  Related  Components.  QSim  is  a  library  developed  as  a  layer  on  top  of 
a  modified  QEMU  CPU  emulator  [2],  which  adds  to  QEMU  the  ability  to  perform  cycle-level 
simulations  based  on  CPU  timing  models.  QSim  is  designed  to  be  used  along  with  simulation 
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frameworks  like  SST  and  Manifold  to  create  a  simulation  environment  capable  of  executing 
Intel  x86  and  x86-64  benchmarks  and  operating  systems. 

The  QSim  implementation  is  independent  of  the  timing  model  to  the  point  that  it  does 
not  require  any  timing  information  beyond  a  periodic  timer  interrupt  for  the  operating  system. 
The  timing  model  sets  execution  callbacks  within  QSim  and  uses  the  information  they  provide 
about  the  instruction  stream  to  determine  the  advancement  of  execution  within  QSim  and  the 
time  within  the  simulator.  QSim  can  be  considered  a  full- system  re-implementation  of  Shade 
[3]  for  the  manycore  era  with  the  addition  of  timing  feedback. 

As  a  front-end  with  no  concept  of  time,  QSim  relies  entirely  on  a  back-end  simulation  to 
control  the  advancement  of  execution  and  provide  meaningful  results.  In  the  development  of 
this  back-end  system  it  has  become  clear  that  the  simulation  frameworks  on  which  they  are 
developed  could  benefit  from  an  implementation  of  processes. 

The  first  complex  component  being  developed  as  a  part  of  the  back  end  for  QSim  is  a 
model  of  a  non-blocking  cache  memory.  Combined  with  the  existing  simple  CPU  models 
and  an  implementation  of  a  coherence  protocol,  the  first  manycore  simulations  using  QSim 
will  be  realized.  During  the  initial  stages  of  the  implementation  of  this  component,  it  has 
become  apparent  that  implementation  complexity  and  code  readability  would  suffer  greatly 
if  processes  remain  unavailable.  A  read,  for  example,  must  be  split  over  at  least  fourteen 
functions,  or  fourteen  states  within  a  single  function,  to  handle  various  delays  and  corner 
cases.  With  processes,  this  same  read  can  be  represented  as  a  single  blocking  function  with 
ordinary  control  flow  constructs. 

3.  Related  Work. 

3.1.  Processes  and  Simulation.  Processes  have  long  been  a  standard  feature  of  discrete 
event  simulators.  Frameworks  for  discrete  event  simulation  frequently  feature  the  process  as  a 
fundamental  abstraction.  Both  SimPy  and  CSIM  [13,12]  use  an  implementation  of  coroutines 
to  provide  a  process  abstraction  to  their  users.  They  are  limited  by  the  fact  that  their  users  are 
forced  to  use  their  one-size-fits-all  implementation  of  processes  and  cannot  work  with  events 
directly,  which  may  be  appropriate  for  performance  or  implementation  clarity. 

Some  frameworks  have  gone  beyond  the  process  abstraction  and  proposed  semantics 
more  appropriate  for  their  field.  YetiSim  [8]  makes  its  fundamental  abstraction  the  UML 
execution  graph  instead  of  the  process.  Just  like  the  earlier  process-based  simulation  frame¬ 
works,  however,  YetiSim  suffers  from  the  limitation  that  simulations  written  with  YetiSim 
cannot  easily  interact  with  simulations  written  with  other  paradigms. 

What  we  have  implemented  is  a  lawyer  that  allows  for  hybrid  simulations.  Events  can 
be  used  directly  where  appropriate  and  our  process  layers  can  be  used  within  components  to 
provide  the  familiar  process  paradigm  in  addition  to  events.  Processes  can  be  created  and  run 
until  they  block  from  event  handlers  or  other  processes,  caused  to  advance  by  the  setting  of 
a  condition  variable  from  within  an  event  handler  or  another  process,  and  can  communicate 
through  messages  sent  on  links  with  other  components  that  may  themselves  use  either  of  the 
process  implementations  described  here,  events,  or  something  altogether  different. 

3.2.  Cooperative  Threads  using  Events.  Quite  a  bit  of  work  has  been  done  on  combin¬ 
ing  the  performance  of  event-driven  systems  with  the  programmability  of  threads  in  domains 
where  event-driven  systems  perform  better  but  threads  are  more  programmable,  although  this 
work  has  not  yet  spread  to  the  realm  of  discrete  event  simulation,  where  processes  have  tra¬ 
ditionally  been  provided  by  the  simulation  kernel. 

On  the  smallest  scale,  in  the  embedded  domain,  are  operating  systems  designed  for 
resource-limited  microcontrollers  for  use  in  applications  like  wireless  sensor  networks.  The 
popular  small-scale  embedded  OS  Contiki  [4]  provides  a  pure  event-driven  paradigm  to  pro- 


C.D.  Kersey  and  A.F.  Rodrigues 


365 


grammers.  Protothreads  [5]  wraps  Contiki  with  the  most  basic  of  stackless  threadlike  struc¬ 
tures.  Though  they  do  not  maintain  per-thread  state,  which  must  be  managed  by  the  program¬ 
mer  with  global  variables,  Protothreads  are  useful  because  they  the  problem  of  source  code 
organization  inherent  in  event-driven  systems. 

In  the  network  server  domain,  the  TaskJava  [6],  Tame  [9],  and  Capriccio  [14]  program¬ 
ming  language  extensions  provide  the  same  functionality,  with  the  additional  feature  of  man¬ 
aging  state  without  the  use  of  a  stack.  Just  like  Protothreads  they  provide  the  programmability 
of  threading  with  the  lower  memory  and  time  overhead  of  events. 

The  approach  common  to  all  of  these  systems  is  factoring  blocking  functions  into  a  form 
of  continuation  passing  style  (CPS)  [6],  where  a  value  representing  the  current  state  of  the 
function  can  be  passed  back  to  that  function  as  an  argument  at  a  later  time  to  continue  the 
execution  of  that  function.  When  it  reaches  the  next  blocking  point,  the  function  returns  a 
continuation  that  can  later  be  used  to  restart  its  execution  from  the  point  it  left  off.  Com¬ 
bined  with  appropriate  synchronization  primitives,  continuations  can  be  used  as  a  substitute 
for  cooperative  threads,  potentially  at  a  much  lower  cost.  In  our  second  process  layer  im¬ 
plementation,  cps-proc,  we  apply  this  design  style  in  the  domain  of  computer  architecture 
simulation. 

4.  Semantics.  We  will  refer  to  the  basic  operation  provided  by  the  simulation  kernel  as 
schedule,  schedule  takes  three  arguments,  the  number  of  simulation  steps  in  the  future 
for  its  target  function  (event  handler)  to  be  called,  the  target  function  itself,  and  an  optional 
argument  to  be  passed  to  the  target  function.  A  variant  of  schedule  is  the  fundamental 
operation  in  both  SST  and  Manifold’s  simulation  kernels. 

To  schedule,  our  process  layers  add  four  process-oriented  operations:  spawn, 
run_ready,  cond_wait,  and  cond_set.  Both  implementations  provide  only  these  four  oper¬ 
ations,  which  are  sufficiently  expressive  for  a  wide  range  of  hardware  modelling  tasks. 

The  spawn  primitive  creates  a  new  process.  Its  arguments  are  the  main  function  of  the 
process  and  an  argument  to  be  passed  to  it.  spawn  can  be  called  from  constructors,  event 
handlers,  or  within  another  process.  When  a  task  is  created  it  immediately  becomes  runnable 
and  is  added  to  the  end  of  the  ready  queue. 

The  run_ready  operation  can  be  used  from  within  event  handlers  and  constructors  but 
not  other  processes.  It  runs  all  ready  processes  until  they  are  finished  or  blocking.  In  com¬ 
ponents,  it  is  usually  seen  accompanying  a  spawn  operation  to  ensure  that  the  newly  created 
process  will  be  run. 

The  only  synchronization  provided  by  these  process  implementations  is  a  Boolean  con¬ 
dition  variable.  Processes  only  block  by  using  the  cond_wait  operation  on  these  variables, 
which  takes  as  arguments  a  condition  variable  and  the  desired  value.  Processes  that  have 
called  cond_wait  will  only  resume  when  the  condition  variable  has  achieved  the  given  value, 
by  having  cond_set  called  on  it.  In  our  C++  implementations,  the  assignment  operator  is 
conveniently  overloaded  to  have  the  function  of  cond_set.  This  convention  is  followed  by 
the  pseudocode  examples  as  well.  cond_wait  can  only  be  used  from  within  a  process,  but 
condition  variables  may  be  set  from  anywhere,  allowing  event  handlers  to  trigger  condition 
variables  and  allow  processes  to  advance. 

There  is  not,  as  there  is  in  VHDL  processes  [1]  and  others,  a  primitive  to  wait  for  a  given 
period  of  simulation  time.  However,  an  implementation  is  provided,  built  using  the  schedule 
and  cond_set  primitives.  A  pseudocode  rendition  of  this  is: 

cond_var  ready; 


pause (delay)  { 
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ready  =  false; 

schedule (delay ,  become_ready) ; 
wait(ready,  true); 

} 

become_ready()  { 
ready  =  true; 

} 

5.  Implementation. 

5.1.  ult-proc.  Our  first  implementation  of  a  process  layer  for  discrete  event  simulation 
frameworks  is  ult-proc.  In  its  original  implementation,  ult-proc  provided  a  minimal  wrapper 
for  the  getcontext  and  swapcontext  calls.  The  default  stack  size  is  set  to  two  kilobytes, 
sufficient  for  processes  without  many  large  local  variables  or  deep  call  nests.  An  optional 
third  parameter  to  spawn  allows  the  stack  size  to  be  adjusted  to  meet  the  demands  of  the 
application. 

The  advantage  to  using  user-level  threads  over  other  means  is  simplicity,  both  of  the 
implementation  and  the  API.  While  cps-proc  requires  all  blocking  functions  to  be  wrapped 
inside  of  C++  classes,  with  local  variables  represented  by  class  member  variables,  ult-proc 
can  make  any  function  that  returns  void  and  takes  a  single  pointer  argument  into  a  process. 
The  disadvantages  are  poorer  performance  than  our  other  implementations  and  that  the  stack 
and  process  state  represent  binary  regions  which  are  not  readily  serialized.  If  local  variables 
could  possibly  be  pointers,  then  serialization  becomes  altogether  impossible.  This  makes 
ult-proc  a  non-option  for  large-scale  simulations  using  SST. 

5.2.  A  Better  Context  Switch,  the  cost  of  using  ult-proc  is  in  the  memory  overhead 
of  unused  stack  space,  the  extra  time  needed  to  allocate  it,  and  the  time  overhead  of  the 
context  switch.  While  the  cost  of  a  context  switch  in  cps-proc  is  rather  light,  in  ult-proc  all 
of  the  callee-saved  CPU  registers  must  be  saved  to  some  memory  region  and  later  restored. 
When  using  swapcontext  the  situation  becomes  worse.  Implementations  of  swapcontext 
are  required  to  preserve  the  signal  mask  of  each  of  their  threads,  which  should  never  be 
touched  by  a  simulator  component.  On  x86  hardware  running  Linux,  this  requires  a  call  to 
sigprocmask,  which  incurs  hundreds  of  cycles  of  overhead  as  observed  in  Section  6. 

5.3.  cps-proc.  The  second  implementation  uses  a  technique  similar  to  Protothreads  [5], 
handling  all  of  the  complexities  of  resuming  from  continuations  with  the  preprocessor.  There 
is  still  some  design  overhead;  the  component  writer  must  remember  to  adhere  to  conventions 
when  creating  processes.  Other  projects  in  the  event-driven  systems  domain  have  dealt  with 
this  problem  using  source- to- source  translation  Tame  or  by  adding  new  language  features  [6]. 
Instead,  cps-proc  imposes  a  set  of  constraints  on  the  programmer  in  exchange  for  working 
with  the  existing  toolchain.  The  Poisson  process  traffic  generator  example  from  Section  2.2 
reimplemented  in  the  style  expected  by  cps-proc  would  be: 

class  process_main  :  public  cps_proc  { 
cps_delay  pause; 
cps_proc  *main()  { 

CPS_PROC_BEGIN() ; 

pause. period  =  exp_rn(MTBP) ; 

CPS_PROC_CALL (pause) ; 
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while  (running)  { 
generate_packet() ; 
pause. period  =  exp_rn(MTBP) ; 
CPS_PROC_CALL (pause) ; 

} 

CPS_PROC_END() ; 

} 

}; 


constructorO  { 

spawn_process(new  process_main()) ; 

} 

The  main  function  of  the  process  must  be  named  main  and  return  a  pointer  to  a  cps_proc. 
This  function  must  be  wrapped  in  a  class  that  inherits  from  cps_proc.  Any  variables  that 
would  normally  be  local  variables  of  the  main  function  that  must  persist  over  blocking  calls 
should  instead  be  declared  as  member  variables,  main  must  begin  with  an  instantiation  of 
the  macro  CP S_PROC _BEGIN  and  end  with  an  instantiation  of  CPS_PROC_END.  Blocking  calls 
must  be  performed  on  objects  of  the  same  type  with  CPS_PROC_CALL.  These  can  be  class  vari¬ 
ables  or  allocated  on  the  heap.  Not  seen  in  this  example  is  the  CPS_PROC_WAIT  macro,  which 
implements  the  cond_wait  primitive  as  described  in  Section  4. 

6.  Performance. 


Performance  on  Pingpong  Benchmark  Performance  on  Server  Benchmark 


xl  M  Iterations  through  Body  of  Process  xl  M  Processes  Spawned 

Fig.  6.1.  The  ping-pong  synthetic  benchmark  can  be  used  to  roughly  estimate  the  time  required  for  a  wait 
operation  that  blocks.  Normalized  averages  for  the  time  of  a  single  iteration  through  the  process  loop  of  a  single 
ping-pong  component  are  1,  3.6,  and  11.8  for  cps-proc,  ult-proc,  and  ult-proc-swapcontext  respectively.  The  server 
synthetic  benchmark  can  be  used  to  roughly  estimate  the  time  required  for  a  spawn  operation.  Normalized  estimates 
for  the  time  of  a  single  spawn  in  the  server  benchmark  are  15,  19,  and  35  for  cps-proc,  ult-proc,  and  ult-proc- 
swapcontext  respectively. 


6.1.  Synthetic  Benchmarks.  Two  synthetic  benchmarks  were  devised  to  find  rough  ap¬ 
proximations  of  the  amount  of  time  required  to  block  on  a  condition  variable  with  cond_wait 
and  the  amount  of  time  required  to  spawn  a  new  process.  These  benchmarks  can  be  used  to 
provide  a  first-order  approximation  of  the  relative  performance  of  actual  simulation  compo¬ 
nents  that  spend  significant  portions  of  their  time  invoking  the  process  system. 

The  first  benchmark,  built  for  the  purpose  of  estimating  the  time  required  for  a  blocking 
wait,  is  called  ping-pong.  It  consists  of  two  components  each  running  a  single  process 
constantly  passing  an  event  back-and- forth  until  the  simulation  has  ended. 
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Figure  6.1  displays  the  execution  time  with  respect  to  the  number  of  loop  iterations  al¬ 
lowed  to  complete.  Each  iteration  contains  two  cond_wait  operations  that  are  guaranteed  to 
block.  It  can  be  seen  that  they  are  straight  lines  with  origins  near  zero,  and  in  fact  their  execu¬ 
tion  time  can  be  predicted  using  a  linear  approximation  based  on  the  computed  iteration  time 
to  within  an  RMS  error  of  5.6ms.  On  the  hardware  used  for  the  experiment,  an  Intel  Xeon 
workstation  clocked  at  3 GHz,  half  of  the  iteration  time,  a  rough  estimate  of  the  time  required 
to  perform  a  blocking  cond_wait,  is  76ns  for  cps-proc  and  272ns  for  ult-proc.  The  bene¬ 
fit  of  a  context  switch  without  system  calls  can  be  seen  by  noticing  that  the  iteration  period 
more  than  triples  to  896ns  for  ult-proc  using  swapcontext.  Normalizing  these  values  to  1, 
3.6,  and  11.8  for  cps-proc,  ult-proc,  and  ult-proc- swapcontext  respectively  gives  a  platform 
independent  measure  of  the  relative  time  required  for  a  blocking  cond_wait. 

The  “baseline”  time  given  in  Figure  6.1  is  for  a  version  of  the  ping-pong  benchmark 
implemented  without  using  a  process  library.  Instead  of  blocking  twice  per  iteration  of  a 
loop,  the  baseline  time  is  simply  the  time  required  to  schedule  a  new  event  when  necessary, 
0.61  on  our  normalized  scale. 

The  second  synthetic  benchmark  is  designed  to  provide  an  estimate  of  the  time  required 
to  spawn  a  process.  It  is  a  server  that,  upon  the  receipt  of  a  request,  spawns  a  new  process 
to  handle  it.  This  process  delays  for  a  predetermined  period  and  then  schedules  an  event 
corresponding  to  returning  a  result. 

The  results  of  ten  runs  of  this  benchmark  are  also  in  Figure  6.1.  Again,  assuming  that 
iteration  times  add  linearly  is  a  valid  approximation,  within  an  RMS  error  of  49ms.  On  our 
test  system,  the  time  for  an  iteration  of  the  server  benchmark  less  the  half-iteration  time  for 
the  ping-pong  benchmark,  an  approximation  of  the  time  required  to  spawn  a  process,  has  been 
determined  to  be  1.12yus,  1.46//S,  and  2.65//S  for  cps-proc,  ult-proc,  and  ult-proc- swapcontext 
respectively.  These  values,  normalized  on  the  same  scale  as  the  normalization  of  the  ping- 
pong  iterations  are  15,  19,  and  35.  Spawning  a  new  process  has  a  higher  cost  than  blocking 
in  an  existing  process. 

6.2.  Profiling.  The  QSim  front  end  provides  a  statistically  sampled  profiler  that  pro¬ 
duces  a  flat  profile  of  dynamic  instructions  spent  in  each  function  in  both  the  kernel  and  the 
application.  The  advantages  to  this  over  a  standard  profiler  like  gprof  are  that  it  imposes  no 
overhead  for  instrumentation  and  provides  an  accurate  picture  of  instructions  executed  within 
the  operating  system  as  well  as  user  mode.  The  disadvantages  are  that  it  offers  no  call  graph 
and  without  a  timing  model  can  only  provide  profiles  in  terms  of  dynamic  instructions,  not 
time.  The  greatest  deficiency  in  our  experiments  of  the  use  of  dynamic  instruction  counts  is 
understanding  the  impact  of  making  frequent  system  calls.  Dynamic  instruction  count  is  a 
poor  estimator  of  execution  rate  when  two  frequently  executed  instructions  cause  a  permis¬ 
sion  level  switch  costing  hundreds  of  cycles. 

To  illustrate  how  much  the  cost  of  making  system  calls  is  underestimated  by  dynamic 
instruction  counts,  the  RMS  error  for  a  fixed-time-per-instruction  estimation  of  the  runtime 
of  the  various  forms  of  both  benchmarks  is  0.44s.  Omitting  ult-swapcontext  this  average  error 
drops  down  to  0.17s. 

In  Figure  6.2,  we  see  that  most  of  the  dynamic  instructions  in  both  benchmarks  are  spent 
allocating  memory,  both  for  events  and  process  state.  The  profile  also  nicely  demonstrates 
the  advantage  of  relying  on  a  simple  user-level  context  switch  in  ult-proc  even  though  the  real 
cost  of  this  context  switch  is  much  greater.  Not  only  is  there  significantly  less  context  switch 
overhead,  but  the  cost  of  allocating  a  process  has  also  decreased  due  to  the  smaller  size  of  a 
process. 

7.  Conclusions.  We  have  demonstrated  a  portable,  serializable  process  implementation 
that  works  as  a  layer  separate  from  an  underlying  event-based  simulation  kernel,  intended 
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Benchmark  Dynamic  Instruction  Breakdown 
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Fig.  6.2.  Summary  of  dynamic  instruction  breakdown  gathered  with  QSim  profiler. 


for  use  in  the  implementation  of  hardware  component  models  for  parallel  simulation  frame¬ 
works  like  SST  and  Manifold.  In  addition  to  this  we  have  demonstrated  a  portable  but  non- 
serializable  process  implementation  that  fills  the  same  role  and  provides  a  more  convenient 
programming  environment  to  the  component  writer.  We  have  also  shown  that  the  system  con¬ 
text  switch  functions  are  inappropriate  for  implementations  of  processes  in  simulations  and 
replaced  them  with  a  more  adequate  solution.  Use  of  the  same  context  switch  library  within 
QSim  increased  its  performance  in  certain  applications  threefold. 

8.  Future  Work.  It  is  clear  from  Figure  6.2  that  more  work  can  be  done  to  improve 
the  efficiency  of  memory  allocation  in  simulations  running  on  SST  and  Manifold.  Even  the 
implementation  of  the  ping-pong  benchmark  that  used  no  process  system  at  all  had  compa¬ 
rable  memory  allocation  overhead  to  the  implementation  using  cps-proc.  This  is  largely  due 
to  the  fact  that  every  event  created  must  be  allocated  on  the  heap  by  the  component.  The 
optimization  of  allocators  could  have  a  significant  impact  on  simulation  performance. 
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RELIABILITY  SIMULATION  FOR  STRUCTURAL  SIMULATION  TOOLKIT 

AARON  S.  WILLIAMS*  AND  ARUN  F.  RODRIGUES1 

Abstract.  Large  scale  reliability  simulation  is  an  important  topic  in  high  performance  computing.  It  is  important 
to  determine  how  long  a  system  will  be  available  to  complete  a  calculation  without  a  failure.  These  failures  can 
happen  on  multiple  levels  throughout  the  system  and  these  failure  rates  are  all  different  so  having  a  simulator  where 
each  components  failure  rate  can  be  modeled  independently  is  important.  In  this  paper  we  discuss  the  motivation 
behind  reliability  simulation,  the  Structural  Simulation  Toolkit  on  which  the  reliability  simulation  is  being  based, 
and  then  give  the  basics  of  the  various  components  needed  to  complete  the  simulator.  We  also  discuss  a  possible 
addition  to  the  project  as  well  as  the  desired  output  once  it  is  complete. 

1.  Introduction.  The  goal  of  the  ACS  Reliability  and  Resilience  project  is  to  estimate 
the  failure  rates  of  system  components  and  do  this  using  only  monitoring  data.  Our  approach 
is  to  use  the  Structural  Simulator  Toolkit  (SST)  to  generate  realistic  data  in  order  to  test  the 
new  algorithms  being  developed.  SST  is  being  adapted  to  provide  a  simulation  of  jobs  being 
executed  on  a  set  of  connected  components  with  specified  failure  rates.  Failure  statistics  are 
collected  from  this  simulation  to  be  fed  into  some  algorithms  for  testing  purposes.  In  section  2 
we  describe  the  project  tasks  and  the  types  of  data  needed  to  be  generated.  Section  3  provides 
a  brief  description  of  SST.  In  section  4  we  describe  how  the  system  must  be  described  in  XML 
to  properly  create  the  connections  of  the  graph  of  components  being  simulated.  Section  5 
describes  the  reliability  simulation  workload/job  list.  In  section  6  we  describe  the  components 
which  were  added  to  SST  for  resilience  simulation.  Section  7  describes  the  output  results 
acquired  from  SST  and  section  8  provides  a  summary  of  the  conclusion  of  this  project. 

2.  Simulation  Task.  In  order  to  develop  this  failure  simulation  capability  in  SST  we 
had  to  do  several  tasks.  First,  we  had  to  design  and  create  simulation  components.  There 
are  several  types  of  components  necessary,  the  leaf  or  compute  nodes,  the  reliability  nodes, 
the  scheduler  node,  and  the  infrequent  component.  The  leaf  or  compute  nodes  are  the  nodes 
on  which  jobs  get  assigned  and  they  take  a  certain  scheduled  time  to  complete  these  jobs. 
The  reliability  nodes  create  the  tree  structure  that  connects  the  leaves.  They  can  be  pictured 
as  a  rack  in  which  four  compute  nodes  are  connected  or  a  cabinet  in  which  multiple  racks 
are  connected.  The  scheduler  node  takes  the  set  of  jobs  to  be  accomplished  and  maps  them 
across  the  leaf  nodes  and  also  takes  care  of  bookkeeping  for  successes  and  failures  of  jobs. 
The  infrequent  component  is  optional  to  this  task  but  is  a  component  that  is  rarely  needed 
by  the  compute  nodes  but  where  failure  would  still  affect  the  completion  of  a  task.  Secondly 
we  are  working  on  a  method  to  distill  the  records  into  ah  output  useful  by  the  team  modify¬ 
ing  the  algorithm  which  will  use  this  test  data.  Lastly  we  will  need  to  continue  to  support 
development  of  the  simulator  as  the  algorithm  team  finds  bugs  in  the  simulator  code. 

3.  Overview  of  Structural  Simulator  Toolkit  (SST).  We  are  in  the  process  of  devel¬ 
oping  the  second  generation  of  an  event  driven  computer  simulation  tool  called  Structural 
Simulation  Toolkit[l].  This  reliability  simulation  capability  is  being  built  as  an  addition  to 
many  other  capabilities  SST  contains.  At  the  beginning  of  a  run  of  the  simulator,  an  XML 
file  is  read  which  defines  how  the  user  wants  the  system  connected  or  “wired  up”.  These 
connections  are  made  using  links  on  which  events  can  be  sent  between  components.  These 
events  are  developer  defined.  Event  driven  simulation  means  that  everything  that  happens  in 
the  simulator  is  triggered  by  an  event  on  a  link  from  either  another  component  or  the  same 
component  trying  to  trigger  an  action  internally.  These  events  are  received  and  handled  by 
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Fig.  4.1.  Partial  example  of  XML  configuration  for  simulation 


an  “event  handler”  which  is  defined  by  the  developer.  This  event  system  is  what  we  utilize  to 
simulate  failures  and  communication  down  through  the  tree  that  a  failure  has  occurred. 

4.  Establishing  System  to  Simulate  in  XML.  The  requirement  for  the  simulator  was  a 
script  that  configures  the  system  that  is  being  tested.  We  refer  to  this  phase  as  “Instantiating 
the  System”.  A  graphical  example  of  a  way  to  lay  out  these  nodes  is  provided  in  figure 
4.1.  This  script  is  given  a  number  of  leaf/compute  nodes  we  intend  to  implement.  There 
is  a  specific  tree  structure  required  for  the  algorithms  being  implemented.  This  script  takes 
this  number  of  leaf  nodes  and  automatically  generates  an  XML  file  of  nodes  including  these 
leaf  nodes  and  the  other  nodes  required  to  complete  the  tree.  The  script  generates  the  links 
between  these  nodes  as  required.  In  figure  4.2,  a  portion  of  a  6  compute  node  tree  is  given. 
The  two  components  you  see  defined  are  the  root  node  at  the  top  of  the  tree  and  one  just  below 
it.  We  observe  that  each  component  is  given  a  unique  id  and  is  of  type  “resil”.  This  tells  SST 
which  component  class  to  use.  We  then  see  listed  below  that  the  links  that  are  connected  to 
the  component,  which  port  to  connect  them  to,  and  what  the  title  of  the  edge  is  called.  The 
title  of  the  edge  must  match  between  the  two  components  you  would  like  to  connect  on  that 
link,  so  that  SST  knows  how  to  wire  correctly.  It  can  also  be  noted  that  a  required  latency 
of  1  unit  of  time,  in  this  case  1  nanosecond,  is  a  parameter  of  each  link  because  SST  does 
not  allow  an  event  to  arrive  at  the  exact  same  time  it  was  sent  unless  it  is  a  self  link  to  the 
component.  Each  component  also  has  a  parameter  “lambda”.  This  parameter  is  used  along 
with  an  exponential  distribution  to  determine  the  random  time  at  which  this  component  will 
fail. 


5.  Establishing  the  list  of  jobs  to  simulate.  A  script  has  also  been  written  which  out¬ 
puts  a  list  of  jobs  to  be  accomplished  by  the  given  N  leaf  nodes.  The  job  list  has  a  list  of  jobs 
each  of  which  is  assigned  a  “Job  ID”,  a  “size”  or  number  of  leaves  necessary  to  compute,  and 
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ccomponent  id=,,l.l"> 

<resil> 

<paramsxlambda>(S) . 2558Q9</lambda> 

<port(S)>l .  1-2  .  l</port(S)>  <portl>l .  1-3 .  l</port lx/par ams> 
<links> 

<link  id="l . 1-2 . 1"> 

<paramsxlat>lNS</latxname>link(S)</namex/params> 

</link> 

<link  id="l . 1-3 . 1"> 

<paramsxlat>lNS</latxname>linkl</namex/params> 

</link> 

</links> 

</resil> 

</component> 

<component  id="1.2"> 

<resil> 

<paramsxlambda>(S) .  25  5809</lambdaxport0>l .  2-2  .  l</port®> 
<portl>l . 2-3 . l</portl> 

</params> 

<links> 

<link  id=" 1 .2-2 . 1"> 

<params>  <lat>lNS</lat>  <name>linkO</name> 

</params> 

</link> 

<link  id=" 1 .2-3 . 1"> 

<params>  <lat>lNS</lat>  <name>linkl</name> 

</params> 

</link> 

</links> 

</resil> 

</component> 


Fig.  4.2.  Partial  example  of  XML  configuration  for  simulation 


a  “duration”  or  number  of  cycles  to  complete  the  task.  An  optional  parameter  on  this  list  that 
may  be  used  later  is  a  list  of  required  components  that  this  job  must  have  to  complete.  This 
list  would  be  used  by  the  scheduler  when  it  is  implemented. 

6.  Implemented  Components.  As  discussed  in  the  introduction,  there  are  several  types 
of  components  required  to  simulate  the  failure  modes  required. 

6.1.  Leaf/Reliability  Components.  In  figure  4.1  we  see  that  there  are  leaf  nodes  de¬ 
scribed  as  well  as  reliability  nodes.  When  first  conceptualizing  this  project  we  thought  that 
these  would  have  to  be  two  different  types  of  components  in  SST.  We  soon  realized  that  their 
behaviors  are  very  similar  and  can  be  combined  into  the  same  component  type.  The  incoming 
link  titles  help  these  components  determine  if  they  are  actually  leaves  at  the  bottom  which 
are  being  assigned  jobs  or  if  they  are  above  that  point  in  the  graph.  Leaf/Compute  component 
pseudo  code  can  be  found  in  figure  6.1 

Here  we  can  see  that  the  component  is  polling  to  see  if  it  has  failed  while  running  a  job 
in  the  pseudo  code.  In  the  event  driven  simulation  however,  the  failure  is  actually  sent  as  an 
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i f (runningA J  ob () ) { 

failed  =  checkForFailureO 
if (failed)! 

t  =  computeDowntimeO  ; 
tellScheduler(failed, t) ; 

}  else  if  (jobDoneO)  { 
tellScheduler(completed) ; 

}  else  if(  needlnfrequentComponentO)  { 
requestFromlnfrequentComponentO 

} 

} 

while(e  =  GetNextEventO)  { 
switch(e)  { 

case  jobLaunch: 
startJobO  ; 

informParentComponentsO ; 
case  failureNotice :  //from  another  component 
killJobO 

tellScheduler(failed) ; 
case  jobKill:  //fromscheduler 

killJobO 

case  infrequentReply :  //an  infrequent  node  gets  back  to  us 
if (e. reply  ==  BROKEN)  { 
killJobO ; 

tellScheduler(failed) ; 

} 

} 


Fig.  6.1.  Leaf/Compute  node  Pseudo  code 


event  to  the  same  component  in  the  future,  and  if  that  event  arrives  before  the  job  is  completed, 
the  job  fails.  Once  a  job  fails  we  assume  instant  repair  time  for  these  components.  This  will 
be  modified  in  the  future  to  be  a  variable  length  downtime.  Once  a  job  has  completed,  the 
component  tells  the  scheduler  that  it  is  free  again  and  can  be  utilized.  Also  the  scheduler 
can  tell  a  component  to  stop  mid  job  if  the  job  a  component  is  computing  has  failed  because 
of  a  failure  elsewhere  in  the  tree  that  didn’t  directly  affect  the  component  itself.  The  same 
component  type  is  used  for  the  reliability  components  in  figure  4.1.  Reliability  component 
pseudo  code  can  be  found  in  figure  6.2. 

Here  we  can  see  that  the  functionality  is  very  similar  to  that  of  the  leaf/compute  nodes 
with  the  only  difference  being  communicating  with  the  scheduler  to  indicate  that  the  com¬ 
ponent  is  free.  Also  this  component  has  an  additional  portion  of  code  where  it  sends  events 
to  the  children  below  it  in  the  tree  saying  that  it  has  failed  in  the  event  of  a  failure  arriving 
on  a  self  link.  A  failure  on  one  of  these  nodes  will  cause  a  stop  on  all  of  the  nodes  directly 
connected  below  this  component  all  the  way  down  to  the  set  of  leaf  nodes  connected  on  its 
branch.  This  would  cause  the  leaf  node  to  notify  the  scheduler  that  a  failure  occurred  and 
thus  its  part  of  the  job  failed. 

The  failure  distributions  are  modeled  as  a  exponential  distribution  with  each  layer  of  the 
tree  having  a  different  “lambda”  value  given.  The  equations  associated  with  the  exponential 
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if  (state  ==  BROKEN)  { 
r  =  checklfRepairedO  ; 
if  (r)  { 

state  =  REPAIRED; 

} 

}  else  if  (state  ==  RUNNING)  { 
failed  =  checkForFailureO 
if  (failed)  { 

t  =  computeDowntimeO  ; 
tellChildren(failed,  t) ; 
state  =  BROKEN ; 

}  else  { 

if  (time()  >  runTill)  { 
state  =  REPAIRED; 

} 

} 

} 

while(e  =  GetNextEventO)  { 
switch  (e)  { 
case  jobLaunch: 

runTill  =  max(runTill,  e. runTill); 
state  =  RUNNING; 
case  failureNotice : 

tellChildren(failed,  t) ; 
state  =  BROKEN; 

} 


Fig.  6.2.  Reliability  node  Pseudo  code 


distribution  are  widely  known  so  we  have  excluded  them.  This  lambda  value  determines  when 
the  self  link  is  sent  in  the  future  to  trigger  a  fail  on  one  of  the  reliability  or  leaf  components. 

6.2.  Scheduler  Component.  The  scheduler  component  is  the  utilized  to  assign  the  job 
list  input  onto  the  compute  nodes.  It  is  also  the  component  that  keeps  track  of  failures  and 
writes  the  output  to  a  file  to  be  used  later  on.  This  first  implementation  of  the  scheduler  is 
very  basic.  It  has  a  list  of  compute  nodes  of  the  system  and  which  ones  of  those  are  currently 
available.  It  implements  a  first  come  first  served  scheduling  scheme.  This  has  no  look  ahead 
into  the  jobs  list  or  concept  of  when  jobs  will  complete  on  the  compute  nodes.  If  a  job 
requires  3  nodes  to  complete,  the  scheduler  waits  for  three  nodes  to  be  free  then  schedules 
this  job  onto  these  nodes.  If  a  failure  causes  the  failure  of  a  job,  this  job  is  moved  to  the 
bottom  of  the  queue  and  is  completed  at  a  later  time.  There  is  a  future  scheduler  planned 
which  will  implement  more  sophisticated  scheduling  tactics  to  optimize  the  throughput  of  the 
system  to  minimize  the  time  taken  to  finish  all  jobs  on  the  list.  Unlike  other  components, 
the  scheduler  is  assumed  to  never  fail.  The  Scheduler  and  Book  keeping  Component  pseudo 
code  can  be  found  in  figure  6.3. 

6.3.  Infrequent  Component.  A  future  component  that  may  be  implemented  will  be  the 
concept  of  an  “infrequent  component”.  This  will  be  a  component  which  would  be  required 
occasionally  for  some  jobs.  It  would  be  modeled  after  components  such  as  a  filesystem  on  a 
high  performance  computer.  It  would  simply  have  an  available  or  unavailable  state  variable 
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while ( ! done) { 

if  (nodeTable .  hasRoomO  )  { 
ScheduleNextJobO  ; 

} 

while (e  =  GetNextEvent()){ 
switch(e)  { 

case  nodeFailure: 

kill Job(e. job)  //kill  the 

recordStats(e) ;  //record 
nodeTable . freeNodes(e .job) ; 
case  jobCompletion: 
recordStats(e) ; 
nodeTable . freeNodes (e . j  ob) ; 

} 

} 

} 


job  on  all  nodes  it  is  running 
//record  the  nodes  as  free 
//record 

//record  the  nodes  as  free 


Fig.  6.3.  Scheduler  node  Pseudo  code 


and  some  repair  time  distribution  associated  with  it  and  jobs  would  have  to  wait  for  it  to  be 
available  to  complete  if  they  required  this  type  of  component.  We  are  not  certain  this  type  of 
component  will  be  required  for  simulations. 

7.  Simulation  Outputs.  There  are  two  different  outputs  required  from  the  simulator. 
The  first  is  a  “job  trace”.  This  is  a  file  that  lists  every  job.  For  each  job  the  Job  ID  is  given 
followed  by  the  result,  pass  or  fail,  the  start  time,  the  end  time  and  which  compute  nodes  were 
used.  The  second  output  is  the  “failure  trace”.  In  this  output  you  get  a  list  of  every  component 
in  the  system,  both  compute  and  reliability  nodes,  and  the  times  they  failed.  These  two  files 
will  allow  failure  statistics  to  be  compiled  and  used  in  the  testing  of  the  modified  algorithms 
for  the  ACS  Reliability  and  Resilience  project. 

8.  Conclusions.  We  have  provided  some  background  for  reliability  simulation  as  well 
as  the  basic  components  necessary  as  well  as  the  inputs  and  outputs  desired.  We  have  also 
described  some  of  the  basics  of  event  based  computer  system  simulation  as  well  as  infor¬ 
mation  on  SST.  We  described  how  SST  needs  a  system  to  be  defined  in  XML  in  order  to 
wire  up  components  and  transmit  events  between  them  on  links.  The  components  described 
above  complete  a  reliability  simulation  system.  They  will  help  with  the  verification  and  test¬ 
ing  of  some  modified  algorithms  with  the  ACS  Reliability  and  resilience  project  here  in  the 
Computer  Science  Research  Institute. 
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Visualization  and  Software  Engineering 

The  methods  and  architectures  in  the  previous  section  have  the  potential  to  generate 
massive  amounts  of  data  with  increasingly  complex  software  infrastructures.  The  articles  in 
this  section  are  dedicated  to  development  and  understanding  of  visualization  and  advanced 
software  engineering  techniques  to  tackle  these  challenging  issues  at  the  core  of  large-scale 
computing  and  computational  science. 

Silva  and  Shepherd  describe  the  development  of  the  Titan  informatics  toolkit  for  the 
VisTrails  workflow  management  and  information  visualization  system.  Using  Titan’s  infor¬ 
matics  processing  algorithms  this  coupling  enables,  for  example,  simplified  workflow  based 
multiple-view  comparative  visualization.  Harger  and  Crossno  compare  a  number  of  open 
source  visual  analytics  toolkits.  Using  a  feature  based  comparison  several  applications  were 
chosen  for  a  more  in  depth  study  using  a  benchmark  document  ‘explorer’  application.  Nus- 
haum  and  Heroux  presents  a  new  graphical  user  interface  framework  for  parameterized  sci¬ 
entific  applications.  This  software  package  replaces  traditionally  used  input  decks,  providing 
a  more  intuitive  way  for  a  user  to  specify  application  parameters.  Fermoyle  and  Heroux  dis¬ 
cuss  the  usage  and  functionality  of  the  newly  developed  Tpetra/Stratimikos  linear  algebra 
capability  in  Trilinos.  The  authors  presentation  provides  a  step-by-step  comparison  of  the 
new  technology  with  the  Epetra/AztecOO  software  stack,  emphasizing  differences  and  the 
enhanced  capabilities  offered  by  Tpetra.  Hardy  and  Clark  overview  the  development  of  test¬ 
ing  capability  for  the  Python  interface  in  Cubit.  The  authors  also  discuss  current  and  future 
metrics  that  can  be  used  in  testing. 
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INFORMATION  VISUALIZATION  USING  VISTRAILS  TECHNOLOGY 

WENDEL  B.  SILVA*  AND  JASON  F.  SHEPHERD^ 

Abstract.  The  Titan  project  is  an  informatics  toolkit  that  provides  a  flexible,  component-based  pipeline  archi¬ 
tecture  for  ingestion,  processing,  and  display  of  informatics  data.  It  integrates  capabilities  from  a  series  of  best-in¬ 
class  open-source  toolkits  for  scientific  visualization,  graph  algorithms,  linear  algebra,  and  more.  VisTrails  is  an 
open-source  scientific  workflow  and  provenance  management  system  that  provides  support  for  data  exploration  and 
visualization.  Moreover,  a  visualization  pipeline  created  using  VisTrails  will  be  able  to  interact  with  other  software 
tools,  including  VisMashup  and  crowdLabs.  Both  software  tools,  Titan  and  VisTrails,  represent  the  state  of  the  art 
in  their  research  areas.  In  this  paper  we  present  the  effort  to  develop  a  package  allowing  the  Titan  toolkit  to  interact 
with  VisTrails.  Using  the  package  developed,  two  projects  within  the  Titan  toolkit  have  been  ported  to  VisTrails, 
VisMashup  and  crowdLabs. 


1.  Introduction.  Nowadays,  there  are  a  number  of  established  toolkits  for  scientific 
visualization  and  data  analysis  (e.g..  VisTrails,  Opendx,  Paraview,  SCIRun).  Many  of  these 
toolkits  use  workflow  based  interfaces  that  expose  computational  components  as  modules, 
and  represent  its  pipeline  as  sequential  and  intuitive  steps.  In  addition,  workflows  allow  the 
creation  of  complex  visualization  pipelines  that  combine  these  modules,  using  the  modules 
to  represent  computational  components  with  connections,  to  express  the  flow  of  data  through 
the  pipeline.  The  use  of  workflow  based  interface  is  useful  for  comparative  visualization 
and  efficient  exploration  of  parameter  spaces.  Although  systems  are  being  developed  with  a 
more  sophisticated  visual  programming  interfaces,  the  path  from  the  raw  data  to  insightful 
visualizations  is  laborious  and  error-prone. 

Another  important  feature  in  scientific  data  analysis  and  exploration  is  provenance.  Prove¬ 
nance  helps  users  interpret  and  understand  results.  By  examining  the  sequence  of  steps  that 
led  to  a  result,  we  can  gain  insights  into  the  chain  of  reasoning  used  in  it’s  production,  verify 
that  the  experiment  was  performed  according  to  acceptable  procedures,  identify  the  experi¬ 
ment’s  inputs  and  reproduce  the  result. 

This  paper  presents  and  describes  the  creation  of  the  Titan  package  for  VisTrails,  al¬ 
lowing  the  construction  and  execution  of  informatics  pipelines  in  a  workflow  based  interface 
visualization  system  with  provenance.  In  addition  the  algorithms  available  in  Titan,  Titan 
coupled  with  VisTrails  also  provides  a  user  interface  that  simplifies  multiple-view  compar¬ 
ative  visualization,  efficient  execution  of  complex  visualization  pipelines,  an  infrastructure 
for  the  provenance,  an  interface  to  VTK  algorithms [12]  in  the  pipeline,  support  for  python 
scripting,  and  a  framework  to  easily  import  python-based  libraries.  Using  the  Titan  pack¬ 
age  in  VisTrails,  we  ported  two  document  analysis  projects  associated  with  the  Titan  project 
(namely,  ParaText[9]  and  Prototype  2  ’P2’).  Utilizing  these  new  VisTrails  pipelines  enabled 
their  execution  with  VisMashup [15]  and  crowdLabs[l]. 

In  section  2  we  present  the  Titan  toolkit  as  well  as  the  two  projects  ported  in  this  paper. 
In  section  3  we  describe  the  creation  of  the  Titan  package  for  VisTrails,  and  we  detail  the 
process  of  building  the  ParaText  and  Prototype  2  pipelines  using  the  Titan  package  (section 
4).  Section  5  describes  the  VisMashup  and  crowdLabs  software  and  show  the  projects  running 
on  it.  Finally,  a  conclusion  is  presented  and  an  outline  of  directions  for  future  work. 

2.  Titan  Toolkit.  The  Titan  Informatics  Toolkit[7]  is  a  collaborative  effort  led  by  San- 
dia  National  Laboratories.  It  represents  a  significant  expansion  of  the  Visualization  ToolKit 
(VTK)  to  support  the  ingestion,  processing,  and  display  of  informatics  data.  The  Titan  project 
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represents  one  of  the  first  software  development  efforts  to  address  the  merging  of  scientific 
visualization  and  information  visualization  on  a  substantive  level. 


Fig.  2.1.  A  document  analysis  application  using  Titan  Informatics  Toolkit  is  shown. 


By  leveraging  the  VTK  engine,  Titan  provides  a  flexible,  component  based,  pipeline 
architecture  for  the  integration  and  deployment  of  algorithms  in  the  fields  of  intelligence, 
semantic  graph  and  information  analysis.  The  VTK  parallel  client-server  layer  will  provide 
an  excellent  framework  for  doing  scalable  analysis  on  distributed  memory  platforms.  The 
benefits  of  combining  the  two  fields  are  already  reaping  rewards  in  the  form  of  functionality. 

Titan  integrates  capabilities  with  a  series  of  best-in-class  open-source  toolkits,  including 
scientific  visualization  (VTK[12]),  graph  algorithms  (Boost  Graph  Library  [11]),  linear  al¬ 
gebra  (Trilinos[3]),  named  entity  recognition  (StanfordNER[10]),  and  others.  Components 
may  be  used  by  application  developers  using  their  native  C++  API  on  all  popular  platforms, 
or  by  using  any  of  a  broad  set  of  language  bindings  that  include  Python,  Java,  and  TCL. 
As  shown  in  Figures  2.1  and  2.2,  applications  are  built  by  combining  Titan  components  to 
address  problems  in  a  specific  domain. 

To  simplify  the  use  of  the  toolkit,  Titan  has  a  set  of  readers,  database  connectors,  compo¬ 
nents,  views  and  interfaces  that  allow  end-user  applications  to  be  constructed  from  the  toolkit 
components.  The  readers  allow  the  user  to  easily  import  data  and  transfer  it  into  one  of  the 
Titan  data  structures.  With  the  data  in  one  of  the  data  structures  available,  the  user  can  then 
use  one  of  many  Titan  algorithms  for  the  processing,  transforming,  modifying  or  visualizing 
the  data. 

In  the  same  way  that  scientific  visualization  applications  can  be  built  with  VTK,  Titan 
can  be  used  to  build  information  visualization  and  analysis  applications.  In  the  subsections 
we  describe  two  projects  currently  present  in  the  Titan  toolkit:  Prototype  2,  ParaText. 

2.1.  Prototype  2.  Prototype  2  (P2)  is  a  document  analysis  tool  built  from  Titan  algo¬ 
rithms  allowing  similar-document  clustering,  entity  extraction,  and  document-entity  relation¬ 
ship  visualization  coupled  with  powerful  graph  search  algorithms  in  a  stand-alone  applica¬ 
tion.  It  organizes  and  classifies  documents  based  on  their  semantic  content  and  performs 
named-entity  (people,  location,  organization,  misc)  extraction  to  allow  the  exploration  of  the 
relationships  between  entities  and  documents.  P2  was  developed  to  include  document  clus¬ 
tering  using  latent  semantic  analysis  (LSA)  and  named-entity  extraction  to  allow  enhanced 
searching  based  on  content  (e.g.,  particular  locations  or  people)  within  a  corpus  of  docu- 
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ments.  P2  facilitates  the  investigation  of  free  text  data,  identifying  the  relationship  between 
the  selected  documents,  and  its  documents  similarities  and  entities.  In  its  current  version,  P2 
supports  the  following  file  formats:  MS  Word,  PDF  and  plain  text. 


Fig.  2.2.  Main  window  of  the  project  Prototype  2.  The  left  figure  show  the  application  after  analyzing  a  set  of 
documents.  At  right,  it  show  the  same  screen  after  some  user  interactions. 


In  Figure  2.2  is  the  main  window  of  the  application  after  the  documents  being  analyzed 
have  been  processed.  To  be  more  precise,  the  view  located  in  the  upper-left  of  Figure  2.2 
(cluster  list  view)  shows  the  group  of  documents  analyzed  into  conceptual  clusters.  The 
documents  with  the  same  topic  are  grouped  in  the  same  cluster.  The  cluster  label  represents 
the  meaningful  terms  that  occur  in  the  documents  within  the  cluster.  Documents  are  also 
clustered  into  groups  with  similar  content  to  each  other. 

A  similarity  graph  in  a  tree-ring  view  is  shown  on  the  bottom-left  corner  of  the  appli¬ 
cation  window  (Figure  2.2).  It  visually  reflects  the  cluster  shown  on  cluster  list  view  and 
shows  the  similarity  links  between  the  documents.  High  similarity  links  are  represented  in 
red,  whereas  links  between  clusters  are  blue  and  green. 

The  middle  of  the  window  display  in  Figure  2.2  A  shows  the  extracted  text  from  the  var¬ 
ious  document  formats  accepted  by  the  application.  The  text  is  displayed  with  the  extracted 
entities  highlighted  and  hyperlinked.  The  words  highlighted  are  color  coded,  based  on  entity 
type  (person,  organization,  location,  misc).  The  hyperlink  attached  to  the  word  will  open  the 
wikipedia  webpage  containing  information  about  that  term  (Figure  2.2  B). 

Users  are  able  to  select  specific  entities  of  interest  by  dragging  them  into  the  hotlist, 
which  appears  just  below  the  list  of  named  entities  (Located  in  the  middle-right  of  the  window 
Figure  2.2).  The  application  then  computes  the  paths  between  entities  using  a  connection 
subgraph  algorithm.  The  algorithm  attempts  to  find  short,  direct  paths  between  nodes  in  a 
graph  by  penalizing  long  paths  as  well  as  paths  which  pass  through  high-degree  nodes.  The 
resulting  graph  can  be  seen  in  the  lower-right  part  of  Figure  2.2  B,  bottom-right  graph. 

2.2.  ParaText.  The  ParaText  system  is  an  end-to-end  process  for  scalable  distributed 
memory  analysis  of  large  document  collections.  It  is  composed  of  a  set  of  text  analysis 
components  designed  to  function  within  a  Titan  data  processing  pipeline,  where  data  sources, 
filters,  and  sinks  can  be  combined  in  arbitrary  ways  .  ParaText  presents  a  full  LSA  process 
from  text  ingestion  and  modeling  data  to  analysis  tasks  including  information  retrieval  and 
document  similarity. 

ParaText  components  are  chained  together  into  data-parallel  pipelines  that  are  replicated 
across  processes  on  distributed-memory  architectures.  Individual  components  can  be  re¬ 
placed  or  rewired  to  explore  different  computational  strategies  and  implement  new  function¬ 
ality.  The  retrieval  method  employed  is  latent  semantic  analysis  (LSA)  [14]  using  singular 
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value  decomposition  (SVD)[8]. 

ParaText  has  been  used  in  applications  spanning  a  broad  set  of  domains.  Figure  2.3 
shows  the  visualization  of  some  data  generated  using  ParaText. 


Fig.  2.3.  Visualization  of  data  generated  using  ParaText  and  visualized  using  ParaView.  At  the  left  shows  the 
result  of  a  bibliometric  analysis  and  in  the  right,  funding  portfolio  analysis. 

The  first  part  of  the  pipeline  consists  of  filters  for  extracting  and  transforming  text.  With 
the  exception  of  determining  which  files  should  be  processed  on  which  processors,  the  filters 
described  in  this  section  all  parallelize  extremely  well[9].  Once  each  processor  determines 
the  list  of  terms  in  its  local  data  (i.e.,  documents),  the  Term  Dictionary  filter  creates  a  global 
dictionary  with  each  term  listed  exactly  once.  Given  the  list  of  local  and  global  terms  dictio¬ 
nary  computed,  each  processor  uses  the  Term  Document  Matrix  filter  to  create  its  portion  of 
a  sparse,  distributed  term-document  frequency  matrix.  Following  creation  of  the  frequency 
matrix,  the  matrix  must  be  weighted  to  incorporate  the  importance  of  the  terms  throughout 
the  collection. 

Once  the  local  and  global  term  weights  are  computed,  the  Scale  Dimension  filter  then 
applies  these  weights  to  the  matrix.  To  compute  the  SVD  of  the  weighted  term-document 
matrix,  ParaText  wraps  the  distributed  block  Krylov-Schur  method  from  the  Anasazi  package 
of  the  Trilinos  solver  library.  In  ParaText,  document  similarities  are  computed  as  the  cosine 
values  between  scaled  LSA  document  vectors. 

3.  Titan  Package  for  VisTrails.  VisTrails [6]  is  an  open-source  system  designed  to  sup¬ 
port  exploratory  computational  tasks,  including  visualization  and  data  mining.  VisTrails  com¬ 
bines  features  of  workflow  and  visualization  systems,  and  provides  a  comprehensive  prove¬ 
nance  management  infrastructure.  The  availability  of  provenance  information  enables  a  series 
of  operations  that  simplify  exploratory  processes  and  foster  reflective  reasoning.  For  example, 
scientists  can  easily  navigate  through  the  space  of  workflows  created  for  a  given  exploration 
task,  visually  compare  workflows  and  their  results,  and  explore  large  parameter  spaces. 

VisTrails  provides  a  plugin  infrastructure  to  integrate  user-defined  functions  and  libraries. 
Specifically,  users  can  incorporate  their  own  visualization  and  simulation  codes  into  pipelines 
by  defining  custom  modules  (or  wrappers).  These  modules  are  bundled  into  packages.  A  Vis¬ 
Trails  package  is  simply  a  collection  of  Python  classes  where  each  of  these  classes  represents 
a  new  module. 

VisTrails  presents  an  easy  way  to  create  an  user-defined  module  as  can  be  seen  in  Algo¬ 
rithm  1.  New  VisTrails  modules  must  subclass  from  Module ,  the  base  class  that  defines  basic 
functionality.  The  only  required  override  is  the  compute( )  method,  which  performs  the  actual 
module  computation.  Input  and  output  is  specified  through  ports,  which  currently  have  to  be 
explicitly  registered  with  VisTrails. 
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class  Div(Module): 

def  compute  (  self  ) : 

argl  =  self  .  getlnputFromPort  (”argl  ”) 
arg2  =  self  .  getlnputFromPort  (”arg2”) 
if  arg2  ==  0.0: 

raise  ModuleError  (  self  ,  ”  Division  ^by^zero  ” ) 
self  .  setResult  (”  result  ”  ,  argl  /  arg2) 

registry  .  addModule  ( Div ) 

registry  .  addlnputPort  (Div  ,  ”argl”,  (  basic  .  Float  ,  ’x’)) 
registry  .  addlnputPort  (Div  ,  ”arg2”,  (  basic  .  Float  ,  ’y’)) 
registry  .  addOutputPort  (Div  ,  ’’result”,  (  basic  .  Float  ,  ’x/y’)) 

Algorithm  1.  Example  of  a  very  simple  user-defined  module 


Although  the  Titan  toolkit  provides  python  bindings,  the  Titan  wrapping  doesn’t  adopt 
the  class  format  expected  in  VisTrails.  To  use  the  Titan  classes  in  VisTrails,  a  python  script 
was  developed  (i.e.,  the  Titan  package)  that  accesses  and  registers  Titan  classes  as  VisTrails 
modules.  This  package  loads  all  Titan  classes  and  translates  them  in  a  format  recognizeable  to 
VisTrails.  The  script  searches  all  python- wrapped  Titan  classes  and  generates  a  corresponding 
module. 

For  optimization  purposes,  the  python  version  of  Titan  doesn’t  define  or  allocate  its  own 
submodules.  Because  of  this,  all  titan  submodules  had  to  be  specified  within  the  VisTrails 
package  to  load  to  make  them  accessible  within  VisTrails. 

During  package  loading,  the  script  executes  the  following  steps.  First,  all  Titan  mod¬ 
ules  are  loaded  and  module  properties  are  extracted  from  each  submodule.  This  process  is 
executed  recursively  ignoring  all  unnecessary  information,  generating  a  tree  with  all  classes 
present  in  the  python  wrap. 


|  VisTrails  Builder  -  p2 titanl.vt 
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Fig.  3.1.  Titan  and  VTK  package  working  together  on  VisTrails. 


Following  the  package  load,  the  Titan  class  tree  is  traversed  and  all  modules  are  regis¬ 
tered  into  the  VisTrails  module  registry.  For  each  module  registered,  the  script  identifies  the 
get/set  methods  of  the  class  and  adds  them  as  module’s  input  and  output.  During  this  process, 
for  each  input/output  the  script  also  identifies  the  input/output  class  type.  At  this  step,  the 
package  checks  if  the  class  is  also  present  in  VTK.  If  so,  the  package  change  the  input/out- 
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put  type  to  use  the  VTK  version  of  the  class.  This  step  is  necessary  to  allow  Titan/VTK  to 
connect  and  work  together. 

In  addition  to  the  get  and  set  methods  of  the  class,  others  methods  may  also  be  included 
in  the  module  as  input/output,  and  some  methods  may  be  designated  as  disallowed  methods. 
The  package  also  keeps  a  list  of  disallowed  classes  and  modules.  Since  VisTrails  doesn’t 
allow  input/output  overload,  the  package  searches  for  overloaded  methods  and  changes  the 
names  appropriately. 

After  the  package  is  loaded,  both  VTK  and  Titan  packages  can  work  together  in  the  same 
pipeline  (Figure  3.1).  As  you  may  observe  in  the  middle  part  of  the  Figure  3.1,  the  package 
identifies  common  modules  and  support  their  connection  and  interaction. 

Although  the  Titan  package  for  VisTrails  was  constructed  to  allow  connectivity  between 
Titan  and  VTK,  the  approach  utilized  doesn’t  support  polymorphism.  The  Titan  package  does 
not  check  the  module’s  superclass,  requiring  the  user  to  do  the  downcasting  when  using  the 
superclass  of  a  module  within  a  PythonSource  module  (Figure  3.2).  As  shown  in  Figure  3.2, 
a  situation  of  polymorphism  is  found  during  the  port  of  the  P2  to  VisTrails.  In  this  situation, 
Titan  module  output  is  an  vtkPDFTextExtractionStrategy  and  the  VTK  module  only  accepts 
vtkTextExtractionStrategy. 

vtk.vtkTextExtractionStrategy  titan.  vtkTextExtractionStrategy 

—  titan.vtkIIRTextExtractionStrategy 
titan.  vtkMSWordTextExtractionStrategy 
_ vtkPDFTextExtractionStrategy 

Fig.  3.2.  This  image  show  the  polymorphism  problem  between  the  Titan  and  VTK  packages. 


^Potyrooipkisfn^ 


Finally,  the  Titan  package  for  VisTrails  has  a  dependence  on  the  Spreadsheet  and  VTK 
packages.  Furthermore,  since  VisTrails  only  supports  modules  implemented  in  Python,  dur¬ 
ing  the  Titan  compilation  it  is  necessary  to  enable  the  creation  of  the  Python  wrappers. 

4.  Porting  to  VisTrails.  A  novice  user  (i.e.,  an  experienced  programmer  that  is  unfa¬ 
miliar  with  the  modules  and  the  dataflow  paradigm),  or  even  an  advanced  user  performing  a 
new  task,  often  resorts  to  manually  searching  for  existing  pipelines  to  use  as  examples.  These 
examples  are  then  adapted  and  iteratively  refined  until  a  solution  is  found.  Unfortunately,  this 
manual,  time-consuming  process  is  the  current  standard  for  creating  visualizations  rather  than 
the  exception  [13].  Although  we  had  access  to  the  P2  pipeline[2]  as  well  as  ParaText  and  P2 
source  code  to  use  as  reference,  the  porting  was  not  straightforward.  In  the  next  subsection 
we  will  give  insight  on  the  porting  process  as  well  as  it’s  results. 

4.1.  P2  running  on  VisTrails.  P2  was  originally  developed  as  a  stand-alone  application 
with  a  Qt  interface.  As  such,  P2  uses  several  different  Qt  widgets  in  the  GUI  front-end 
(e.g.  QListView ,  QTreeView ,  QTextView).  Since  VisTrails  does  not  have  modules  to  wrap  QT 
widgets  to  display  some  of  the  P2  results,  the  P2  pipeline  was  adapted  to  use  the  modules  and 
features  available.  VisTrails  provides  different  ways  to  visualize  data  (e.g.  RichTextCell  to 
display  HTML  and  plain  text,  VTKCell ,  MplEigureCell ,  and  OpenGLCell[ 5]).  Furthermore, 
It  is  possible  to  build  your  own  view  cell  with  the  restriction  that  it  must  inherit  Qt  widget. 

Although  VisTrails  provides  enough  support  to  allow  the  user  to  build  their  own  visu¬ 
alization  module,  the  P2  pipeline  running  on  VisTrails  uses  PyQt  inside  the  PythonSource. 
For  visualization  the  ported  pipeline  also  uses  the  modules  RichTextCell  and  VTKCell.  For 
example,  the  pipeline  uses  RichTextCell  to  visualize  the  documents  analyzed  and  VTKCell  for 
the  ring  graph.  Since  RichTextCell  allow  the  visualization  of  html  files,  the  pipeline  is  able 
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you,  and  when  a  man  has  ra'ally  set  his 
heart  on  such  a  creatur*  it  isn't  a  Mingo, 
or  even  a  Delaware  gal,  that’ll  be  likely 
to  unsettle  his  mind.  You  may  laugh  at  i 
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Fig.  4. 1 .  Prototype  2  running  on  VisTrails  using  Titan. 


1  from  PyQt4  import  QtGui 

2  s elf  .  TreeView  =  QtGui .  QTreeView  ( ) 

3 

4  #Set  the  QTreeView  parameters 

5  #Fill  the  QTreeView  with  the  information 

6 

7  s elf  .  TreeView  .  setModel  (  QtGui .  QStandardltemModel  ()  ) 

8  s elf  .  TreeView  .  show  () 

Algorithm  2.  Example  of  a  python  script  used  in  the  module  PythonSource  to  use  Qt  Widgets 


to  display  some  text  documents  in  it,  as  well  as  highlight  and  hyperlink  the  entities  inside  the 
document.  For  these  views,  during  the  pipeline  a  script  accesses  the  document  analyzed  and 
it’s  entities  to  generate  the  html  (Spreadsheet  middle  cell  seen  in  the  figure  4.1). 

The  figure  4.1  also  shows  the  use  of  VTKCell  and  PyQt.  As  shown  in  the  middle  cell 
of  the  spreadsheet,  the  text  has  been  highlighted  with  4  different  colors:  green,  red,  magenta 
and  blue,  representing  the  entities  person,  organization,  location,  misc,  respectively.  The 
right  cell,  discussed  in  section  2.1,  shows  the  tree-ring  view.  The  left  cell  shows  the  same 
information  as  the  tree-ring  but  in  a  different  graph  format. 

To  visualize  a  tree,  Qt  widgets  are  used.  Using  the  VisTrails  module  PythonSource  it 
is  possible  to  write  and  execute  a  python  script  in  the  pipeline.  Moreover,  this  script  has 
access  to  other  python  modules  allowing  the  use  of  the  package  PyQt.  Since  PythonSource 
is  a  base  VisTrails  module,  it  can  communicate  with  any  other  VisTrails  modules,  even  third 
party  packages. 

In  algorithm  2,  an  example  of  the  PythonSource  that  creates  a  QtTreeView  is  demon¬ 
strated.  To  use  PyQt  in  the  PythonSource  it  is  not  necessary  to  create  Qt  instances  such  as 
QApplication  and  QMainWindow.  One  important  part  of  the  script  is  to  be  sure  to  use  self 
to  define  the  widget  instance  in  order  to  increase  it’s  scope.  Having  the  application  scope, 
the  widget  will  be  active  until  the  user  closes  it  or  the  VisTrails  application  is  closed.  If  the 
variable  is  only  on  the  module  scope,  the  widget  will  disappear  as  soon  as  the  module  finish 
its  execution. 

4.2.  ParaText  running  on  VisTrails.  Although  ParaText  was  originally  designed  for 
execution  in  a  distributed  processing  environment,  the  ParaText  port  to  VisTrails  is  currently 
executed  only  in  a  single  processor  environment.  The  files  may  be  sent  to  the  pipeline  in  two 
ways.  Searching  for  files  in  a  specified  directory,  or  getting  selected  files  from  the  file  system 
or  the  internet.  In  the  second  case,  if  files  are  specified  from  the  internet  the  pipeline  will 
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download  and  use  the  files  specified.  Like  the  P2  port  for  VisTrails,  ParaText  also  uses  the 
combination  of  the  PythonSource  with  RichTextCell  to  display  results. 


Fig.  4.2.  ParaText  being  executed  as  a  VisTrails  workflow. 


As  shown  in  figure  4.2,  the  ParaText  pipeline  generates  an  html  table  and  displays  this  in 
the  spreadsheet.  The  following  information  is  displayed  as  result  of  the  pipeline  execution: 
dictionary  token  count,  unique  token  count,  frequency  matrix,  SVD  best  rank,  and  left  and 
right  singular  and  singular  values. 

4.3.  Other  comments.  PythonSource  was  one  of  the  most  used  modules  during  the  cre¬ 
ation  of  the  P2  pipeline.  Using  this  module  the  user  has  the  freedom  to  add  almost  any  python 
script  to  the  pipeline.  This  module  was  used  heavily  during  the  creation  of  the  pipeline  for 
debugging  purposes.  In  addition,  it  was  used  to  do  downcast ,  generate  PyQt  visualizations, 
and  generation  of  the  html,  among  other  uses. 

Another  important  use  of  PythonSourvce  was  to  update  a  module.  When  the  same  mod¬ 
ule  is  being  used  in  different  parts  of  the  pipeline,  for  memory  maintenance  reasons,  python 
may  remove  the  module  instance  from  memory.  In  this  case,  the  module  has  to  be  ’updated’ 
before  each  usage. 

Some  of  the  Titan  modules  accept  multiple  connections  to  an  input  port.  For  compat¬ 
ibility  reasons  VisTrails  allows  it  but  the  method  is  hidden.  During  the  construction  of  the 
pipelines,  some  modules  used  this  feature. 

5.  VisTrails  Technology.  There  are  some  applications  that  accept  a  VisTrails  file  as 
input  (e.g.  VisMashup  and  crowdLabs).  To  take  advantage  of  these  tools,  the  next  step  after 
porting  the  projects  to  VisTrails  was  to  see  if  we  could  improve  the  projects  using  those 
applications.  In  the  next  subsections  we  will  be  present  these  applications. 

5.1.  VisMashup.  VisMashup  is  a  tool  to  simplify  the  creation,  maintenance,  and  use 
of  customized  visualization  applications.  A  VisMashup  can  be  combined  with  any  workflow 
system  as  long  as  the  workflow  system  provides  access  to  workflow  specifications,  the  ability 
to  identify  and  change  workflow  components,  and  to  execute  the  workflows.  The  VisMashup 
application  is  flexible,  with  a  GUI  automatically  generated  from  a  medley  specification. 

Instead  of  interacting  directly  with  a  dataflow  network  or  a  very  general  and  complex 
GUI,  users  manipulate  and  execute  a  set  of  pipelines  in  a  medley  through  a  small  number  of 
graphical  widgets.  Application  maintenance  is  simplified,  since  when  the  underlying  medley 
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Fig.  5.1.  ParaText  running  on  VisMashup. 


changes,  the  GUI  can  be  automatically  updated.  Furthermore,  besides  setting  parameter  val¬ 
ues  and  synchronizing  parameters,  users  can  customize  a  mashup:  they  can  hide,  show,  and 
move  widgets  around.  The  user  may  do  changes  to  the  pipeline  views  through  the  mashup,  it 
is  propagated  down  to  the  pipeline  level  and  sent  to  the  visualization  system  for  execution. 

The  port  of  a  VisTrails  pipeline  to  a  VisMashup  is  straightforward [4].  Figure  5.1  shows 
ParaText  running  on  VisMashup.  With  the  mashup  the  user  does  not  need  to  know  the 
pipeline,  or  need  to  change  the  parameters  on  specific  modules  within  the  pipeline.  As  shown 
in  Figure  5.1,  the  user  can  interact  with  the  pipeline  parameters  using  only  mashup  widgets 
and  immediately  see  the  results. 

5.2.  crowdLabs.  CrowdLabs  is  a  social  visualization  repository  for  scientific  workflow 
management  systems.  It  adopts  the  model  used  by  social  Web  sites  and  integrates  a  set  of 
usable  tools  and  a  scalable  infrastructure  to  provide  an  environment  for  scientists  to  collabo- 
ratively  analyze  and  visualize  data.  This  kind  of  website  may  lower  the  barriers  for  the  use 
of  scientific  analyses  and  enables  broader  audiences  to  contribute  insights  to  the  scientific 
exploration  process,  without  the  high  costs  incurred  by  traditional  portals.  In  addition,  it  sup¬ 
ports  a  more  dynamic  environment  where  new  exploratory  analyses  can  be  added  on-the-fly. 
CrowdLabs  enables  sharing  of  both  VisTrails  pipelines  and  VisMashups. 

CrowdLabs  provides  the  most  advanced  integration  of  provenance,  workflows,  visual¬ 
izations,  and  social  collaboration  features  to  date.  The  system  goes  beyond  Web  access,  and 
provides  a  powerful  API  that  can  be  used  for  developing  collaborative  analysis  applications. 
CrowdLabs  enables  the  creation  of  a  platform  that  simplifies  the  process  of  integrating  publi¬ 
cations  with  computations,  visualizations,  workflows,  datasets  and  parameter  values  used  to 
create  them. 

CrowdLabs  is  open  to  the  community  allowing  anyone  to  create  an  account  and  have  full 
access  to  the  website  features.  The  upload  of  a  VisTrails  file  is  done  using  the  upload  section 
present  in  the  VisTrails  tab.  Anyone  with  the  right  access  may  access  the  file  uploaded  and 
its  information  that  includes  version  history,  workflows,  modules,  etc.  In  addition,  its  embed 
version  of  the  workflow  can  be  shared  in  others  format  such  as  wiki,  blog  and  LTpX.  To 
create  and  upload  a  VisMashup  on  crowdLabs,  the  VisTrails  files  first  have  to  be  uploaded 
in  the  website.  After  that,  using  the  VisMashup  application  you  may  select  the  VisTrails 
that  contain  the  workflow  you  want  to  use  to  create  a  mashup.  It  is  important  to  stress  that 
the  workflow  needs  a  tag  (a  name)  to  make  the  mashup  work.  Finally,  when  the  mashup  is 
created,  there’s  the  option  to  automatically  upload  it  to  crowdLabs.  Like  the  workflow  it  is 
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possible  to  embed  the  mashup  in  a  webpage. 

6.  Conclusions  and  Future  Work.  In  this  paper  we  presented  and  described  the  de¬ 
velopment  of  the  Titan  package  for  VisTrails.  In  addition,  two  Titan  projects  were  ported 
to  VisTrails,  as  well  as  to  VisMashup  and  crowdLabs.  As  result  of  the  porting  to  VisTrails, 
both  projects  gained  features  including  workflow  based  interface  visualization  and  prove¬ 
nance.  Although  some  modifications  to  the  pipeline  were  necessary,  the  original  and  the 
ported  pipelines  have  the  same  core  functionality.  For  both  projects  a  VisMashup  was  created 
that  allowed  parameter  changes  and  updates  on  the  pipeline  in  a  simplified  interface.  Finally, 
both  VisTrails  and  VisMashup  were  attached  to  the  web  visualization  repository  crowdLabs. 

In  the  future,  ParaText  could  be  implemented  on  VisTrails  to  use  MPI  to  enable  more 
efficient  runs  on  larger  datasets.  Furthermore,  both  projects  could  be  visualized  on  mobile 
devices  (e.g.  ipod,  iphone,  ipad)  using  VisMashup. 

7.  Acknowledgements.  The  authors  would  like  to  thank  Emanuele  Santos  of  University 
of  Utah  for  her  help  with  VisTrails  technology. 
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COMPARISON  OF  OPEN  SOURCE 
VISUAL  ANALYTICS  TOOLKITS 

JOHN  R.  HARGER*  AND  PATRICIA  J.  CROSSNO' 

Abstract.  We  present  the  results  of  a  two-stage  evaluation  of  open  source  visual  analytics  packages.  The  first 
stage  is  a  broad  feature  comparison  over  a  range  of  open  source  toolkits.  Although  we  had  originally  intended  to 
restrict  ourselves  to  comparing  visual  analytics  toolkits,  we  quickly  found  that  very  few  were  available.  So  we 
expanded  our  study  to  include  information  visualization,  graph  analysis,  and  statistical  packages.  We  examine  three 
aspects  of  each  toolkit:  visualization  functions,  analysis  capabilities,  and  development  environments.  With  respect 
to  development  environments,  we  look  at  platforms,  language  bindings,  multi-threading/parallelism,  user  interface 
frameworks,  ease  of  installation,  documentation,  and  whether  the  package  is  still  being  actively  developed.  The 
second  stage  of  our  evaluation  delves  more  deeply  into  a  subset  of  the  toolkits  and  reports  our  experiences  using 
them  to  implement  a  simple  visual  analytics  application.  Based  on  the  number  and  variety  of  visualization  and 
analysis  components  found  within  a  package  during  the  first  round,  we  selected  Prefuse  Flare,  Tulip,  and  Titan  for 
the  second  round.  Our  benchmarking  application,  which  was  picked  based  on  similar  functionality  in  our  second 
round  toolkits,  is  a  document  ’explorer’  application  that  allows  users  to  understand  and  explore  a  corpus  of  document 
relationships. 


1.  Introduction.  With  the  growth  of  Visual  Analytics,  more  applications  are  being  writ¬ 
ten  that  tightly  couple  analysis  and  visualization.  Developers  of  these  applications  have  the 
choice  of  leveraging  third-party  toolkits  or  implementing  desired  functionality  from  scratch. 
There  are  many  advantages  in  working  with  toolkits,  yet  finding  a  toolkit  that  combines  the 
desired  characteristics  for  analysis,  visualization,  and  development  platform  may  not  be  pos¬ 
sible.  Open  source  toolkits  provide  a  solution  in  which  packages  that  come  close  to  developer 
needs  can  be  extended  or  modified.  Although  individual  toolkits  have  been  reviewed  at  web¬ 
sites  such  as  infosthetics.com,  no  single  reference  provides  developers  with  the  comparative 
information  they  need.  However,  the  motivation  for  our  evaluation  stems  from  the  need  to 
understand  how  our  toolkit  compares  to  other  similar  packages. 

Sandia  National  Laboratories,  in  collaboration  with  Kitware,  Inc.,  has  developed  the 
Titan  Informatics  Toolkit.  Titan  is  an  open  source  project  built  upon  the  scalable  architecture 
of  the  Visualization  Toolkit  (VTK).  [28,  21] 

2.  Prior  Work.  Laramee  [22]  provided  a  practical  comparison  aimed  at  assisting  devel¬ 
opers  select  the  best  graphics  or  visualization  library  for  building  their  applications  through 
the  evaluation  of  feature  sets,  software  source  and  development  environment,  and  the  ease  of 
implementing  a  benchmark  application.  However,  the  evaluation  was  limited  to  an  in-depth 
examination  of  four  pre-selected  packages:  VTK,  Open  Inventor,  Coin3D,  and  Hoops  3D.  Al¬ 
though  we  took  inspiration  from  Laramee ’s  approach,  we  wanted  to  provide  greater  breadth 
by  first  doing  a  feature  evaluation  stage  that  looked  at  a  wide-ranging  set  of  open  source  visu¬ 
alization  and  analysis  libraries  as  a  basis  for  selecting  the  toolkits  used  in  the  implementation 
stage.  Another  significant  difference  is  that  our  evaluation  focuses  on  open  source  visual 
analytics  toolkits,  so  we  are  looking  for  packages  with  a  combination  of  visualization  and 
analysis  functionality. 

3.  Functional  Comparison.  Stage  one  involved  comparing  features  of  twenty-four  toolk¬ 
its  and  applications  to  narrow  them  down  to  a  set  for  deeper  analysis.  The  sources  for  finding 
these  toolkits  were  the  Infovis  Wiki  [16]  and  Internet  searches. 

This  section  describes  those  areas  of  functionality  that  we  compared  across  the  toolkits 
we  looked  at  for  the  first  stage  of  the  evaluation.  These  include  visualization  functionality, 


*University  of  New  Mexico,  jharger@unm.edu 
^Sandia  National  Laboratories,  pjcross@sandia.gov 


390 


Visual  Analytics  Toolkits  Comparison 


Fig.  3.1.  Comparison  of  Visualization  Functionality  (Graphs) 
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analysis  functionality  and  a  look  at  the  development  environment  used  by  the  toolkit. 

3.1.  Visualization  Functions.  Visualization  is  important  for  enabling  analysts  to  see  re¬ 
lationships  in  data.  Below  we  list  several  visualization  techniques  that  are  relatively  common 
in  modern  toolkits  and  applications.  We  have  divided  these  into  three  categories:  visual¬ 
izations  of  graphs  and  networks,  visualizations  of  hierarchical  trees  and  visualizations  on 
arbitrary  data  that  can  be  represented  in  tabular  form. 

3.1.1.  Graph  Layout  Views.  The  circular  graph  layout  places  all  vertices  on  a  circle, 
and  draws  edges  as  lines  between  the  vertices  across  the  inside  of  the  circle.  These  layouts 
usually  involve  a  sorting  heuristic  to  group  closely  related  vertices  near  each  other. 

A  radial  graph  layout  requires  that  a  ’’root”  node  be  chosen,  and  then  vertices  are  laid 
out  on  concentric  circles  based  on  their  distance  from  the  root. 

More  common  for  complex  graphs  with  many  possible  cycles  is  the  family  of  force  di¬ 
rected  layouts.  These  involve  running  simple  spring  force  simulations  on  edges  as  well  as 
repulsion  between  vertices.  Normally  done  in  2D,  these  can  also  be  performed  in  3D,  al¬ 
though  that  can  result  in  vertices  and  edges  being  occluded  [6]. 

Hierarchical  [7]  graph  layouts  arrange  directed  graphs  on  a  2D  plane  by  using  directed 
relationships  and  minimizing  edge  crossovers  to  choose  node  position.  This  can  be  used 
with  any  directed  graph,  and  algorithms  typically  eliminate  cycles  in  graphs  by  removing  less 
important  edges  identified  by  some  heuristic  during  layout  calculations. 

A  view  suited  to  smaller  graphs,  the  adjacency  matrix  view  displays  a  graphical  repre¬ 
sentation  of  an  adjacency  matrix.  The  matrix  is  shown  as  a  grid,  with  vertex  labels  along  the 
top  and  side  edges.  For  every  edge,  the  cell  corresponding  to  the  two  vertices  connected  is 
filled  in  with  a  color.  Different  colors  can  be  used  to  represent  different  edge  properties. 

3.1.2.  Tree  Layout  Views.  Tree  maps  nest  each  level  of  the  tree  within  the  rectangular 
region  of  its  parent.  The  size  of  the  region  depends  on  some  vertex  attribute. 
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Fig.  3.2.  Comparison  of  Visualization  Functionality  (Trees) 
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A  dendrogram  places  the  root  node  of  a  tree  on  one  edge  (top,  bottom,  left  or  right)  and 
aligns  the  vertices  of  each  level  of  the  tree  together.  Unlike  the  hierarchical  graph  view,  this 
is  much  a  simpler  algorithms  and  only  works  with  trees. 

Similar  to  the  radial  graph  view,  a  radial  tree  view  places  the  tree  root  at  the  center,  and 
aligns  child  vertices  in  concentric  circles  around  it.  However,  unlike  in  the  general  graph 
case,  there  are  no  edge  crossovers  because  of  the  tree  structure,  resulting  in  a  cleaner  display. 

A  view  similar  to  the  radial  tree  view  is  the  balloon  tree  and  the  related  cosmic  tree  view. 
In  this  view,  child  vertices  are  circularly  arranged  round  their  parent  vertices. 

Icicle  trees  display  trees  in  a  descending  vertical  structure,  which  results  in  an  appearance 
much  like  hanging  icicles.  The  root  of  the  tree  is  at  the  top  level  and  the  children  are  drawn 
to  fit  within  the  width  of  the  parent. 

3.1.3.  Tabular  Data  Views.  Scatter  plots  are  plots  of  two-  or  three-dimensional  data, 
which  is  drawn  by  representing  each  data  point  as  a  point  in  space.  Two-dimensional  data  is 
usually  shown  on  a  plane,  whereas  three-dimensional  data  is  shown  as  a  point  cloud. 

For  statistical  information,  box  and  whisker  plots  (or  simply  box  plots)  show  more  in¬ 
formation  than  standard  bar-  or  line-charts.  They  display  the  ’’five-number  summaries”  of  a 
population:  the  sample  minimum,  lower  quartile,  median,  upper  quartile  and  sample  maxi¬ 
mum. 

Parallel  coordinates  are  visualizations  of  multidimensional  data,  which  represent  the 
axis  for  each  dimension  as  a  vertical  line  and  high-dimensional  points  as  polylines  connecting 
points  on  each  axis. 

Some  visualization  views  are  common  in  spreadsheet  software  programs.  A  bar-  or 
column-chart  is  a  chart  with  rectangular  bars.  The  lengths  of  the  bars  are  proportional  to 
some  values.  Line-charts  are  a  2D  graph  where  points  from  the  same  data  set  are  connected 
by  a  line.  A  pie  chart  is  a  circular  chart  containing  colored  sectors  where  the  arc-lengths  of 
the  sectors  correspond  to  the  proportion  of  the  quantity  to  the  whole.  Area  charts  are  similar 
to  line  charts,  except  that  the  area  below  the  line  is  filled  in. 
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Fig.  3.3.  Comparison  of  Visualization  Functionality  (Tabular  Data) 
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There  are  also  variations  in  the  previously  mentioned  charts.  One  is  the  stacked  bar- 
chart ,  which  is  a  bar  chart  where  the  lengths  of  the  bars  are  the  sum  of  the  data  values,  and 
each  value  is  represented  as  a  portion  of  the  total  bar.  There  are  also  stacked  line-charts  and 
stacked  area-chart  where  the  total  vertical  height  at  any  given  horizontal  point  is  the  sum  of 
all  data  values  at  that  point. 

3.1.4.  Geospatial  and  Spatio-Temporal  Visualizations.  Map  overlays  are  symbols  or 
polygons  drawn  on  top  of  a  map,  either  a  two-dimensional  projection  or  a  three-dimensional 
globe.  Graduated  symbol  maps  are  similar  but  the  symbols  drawn  vary  in  size  based  on  some 
data  set,  for  example  relative  populations.  Choropleth  maps  use  color  coding  to  represent 
aspects  of  regions,  such  as  population,  crime  rate  or  average  incomes. 

There  are  many  different  map  projections  available  in  modern  software.  Dymaxion  maps 
are  global  maps  projected  onto  a  polyhedron  and  unfolded  into  a  two-dimensional  image. 
Cartograms  are  two-dimensional  projections  where  the  area  of  a  region  is  controlled  by  a 
data  value.  A  three-dimensional  globe  is  simply  a  textured  sphere  that  the  user  can  view  from 
any  angle. 

3.2.  Analysis  Capabilities.  Analytics  are  also  important  in  modern  toolkits.  An  analyst 
can  calculate  and  observe  properties  within  the  data.  We  have  divided  analysis  capabilities 
into  two  main  categories:  analysis  on  graphs  and  networks,  and  statistical  analysis  on  arbi¬ 
trary  data. 

3.2.1.  Graph  Analysis.  Adjacency  refers  to  being  able  to  identify  adjacent  neighbor 
vertices.  Accessibility  is  determining  if  one  vertex  can  be  reached  from  another.  Centrality 
means  calculating  measures  of  relative  importance  of  vertices  in  a  graph.  Common  connection 
is  the  process  of  identifying  the  set  of  vertices  that  can  be  reached  by  all  of  the  vertices  in  a 
given  set.  Connectivity  means  identifying  connected  components  in  a  graph. 


J.R.  Harger  and  P.J.  Crossno 


393 


Fig.  3.4.  Comparison  of  Visualization  Functionality  (Geospatial/Spatio-temporal  Data) 
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Fig.  3.5.  Comparison  of  Analysis  Functionality  (Graph) 
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Finding  a  minimum  spanning  tree  for  a  graph  is  identifying  a  tree  that  is  a  subgraph  of 
the  given  graph  that  has  the  minimum  cost  in  terms  of  its  edge  costs. 

Finally,  Connected  component  analysis  refers  to  labeling  subsets  of  connected  compo¬ 
nents  based  on  a  heuristic. 

3.2.2.  Statistical  Analysis.  Univariate  analysis  refers  to  statistical  analysis  of  single¬ 
dimensional  data,  such  as  finding  the  mean  and  variance,  calculating  histograms,  etc.  Bivari- 
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Fig.  3.6.  Comparison  of  Analysis  Functionality  (Other) 


Univariate 

Bivariate 

Multivariate 

Dimensionality 

Reduction 

Clustering 

Latent  Semantic 
Analvsis  (LSA) 

Latent  Dirichlet 
Allocation  (LDA) 

Axiis 

birdeye 

ESTAT 

GeoVista  Studio 

Gephi 

X 

Google  Vis.  API 

GraphViz 

Improvise 

Infovis  Toolkit 

JIT 

JFreeChart 

JGraphX 

JUNG 

X 

Mondrian 

NetworkX 

X 

Prefuse 

Flare 

X 

Protovis 

R 

x 

X 

X 

X 

X 

X 

Titan 

X 

X 

X 

X 

X 

X 

X 

Tulip 

X 

VisAD 

WilmaScope 

Zest 

ate  analysis  means  statistical  analysis  of  two-dimensional  data,  such  as  correlation.  Multi¬ 
variate  analysis  is  any  sort  of  analysis  of  data  in  three  or  more  dimensions. 

Special  purpose  techniques  include  the  following:  Dimensionality  reduction ,  which  is 
reducing  the  number  of  dimensions  in  multivariate  data  by  removing  less  significant  dimen¬ 
sions;  clustering ,  which  is  grouping  data  points  in  clusters  based  on  some  relatedness  criteria; 
latent  semantic  analysis  (LSA),  which  uses  linear  algebra  techniques  to  identify  concepts  in 
documents;  and  latent  dirichlet  allocation  (LDA)  which  uses  statistical  techniques  to  identify 
concepts  in  documents. 

3.3.  Development  Environment.  Platform  is  an  important  aspect  of  development.  A 
toolkit  can  be  designed  to  run  on  Windows,  Mac  OS  X,  Unix-like  systems  or  even  a  vir¬ 
tual  platform  like  Java  or  .NET.  These  days  the  World  Wide  Web  is  becoming  more  popu¬ 
lar  for  deploying  visualization  applications,  and  platforms  to  accomplish  this  include  basic 
JavaScript/HTML  and  Adobe  Flex. 

The  language  that  a  toolkit  is  written  in  can  be  useful  to  know  too;  most  often  the  appli¬ 
cation  developer  must  know  the  implementation  language  to  use  the  toolkit.  These  languages 
are  usually  limited  to  C,  C++  or  Java  for  desktop  applications,  and  JavaScript  or  ActionScript 
for  the  web.  Scripting  can  be  a  powerful  method  for  prototyping  applications  or  writing  sim¬ 
ple  one-off  programs  and  these  languages  can  be  easier  for  non-professional  programmers 
to  work  with  [25].  Therefore  scripting  languages  that  are  directly  supported  by  the  toolkits, 
such  as  Python  and  TCL  are  also  listed  here. 

GUI  frameworks  are  the  front-end  interfaces  available  for  each  toolkit.  These  include 
frameworks  such  as  Qt  in  C++,  Swing  and  SWT  in  Java,  and  web  GUI  interfaces  such  as 
HTML  and  Adobe  Flash. 

SQL  database  support  comes  in  two  flavors.  The  most  useful  is  direct  toolkit  support  for 
loading  data  from  databases.  This  provides  a  quick  solution  to  loading  data  without  needing 
to  know  much  about  how  databases  work.  The  other  is  lower  level  framework  support  for 
databases.  In  this  case,  the  application  developer  must  write  the  interface  to  convert  between 
data  formats  themselves.  While  this  is  still  a  useful  feature,  it  can  lead  to  longer  development 
times  than  direct  integration. 
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Fig.  3.7.  Development  Environment  Comparison 
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File  format  support  can  be  useful  for  exchanging  data  with  other  users  or  applications, 
especially  when  dealing  with  graphs.  There  are  several  formats  that  support  graphs,  such  as 
DOT  [12],  GraphML  [3],  GML  [14]  and  VTK’s  graph  file  format  [21]. 

The  documentation  field  lists  the  available  documentation,  such  as  tutorials,  user’s  guides, 
reference  manuals  and  online  API  documentation. 

4.  Benchmark  Implementation.  Stage  two  consisted  of  designing  and  implementing  a 
simple  benchmark  application  that  attempts  to  demonstrate  visualization  as  well  as  analysis 
capabilities.  Based  on  the  results  of  phase  one,  we  chose  the  three  toolkits  that  appeared 
to  have  the  most  visualization  as  well  as  analytic  functionality  out  of  those  we  evaluated  in 
phase  one:  Prefuse  Flare,  Tulip  and  the  Titan  Informatics  Toolkit. 

4.1.  Benchmark  Application.  The  benchmark  application  had  three  basic  tasks  to  per¬ 
form. 

First,  it  needed  to  be  able  to  load  a  graph  of  documents  and  links  between  similar  doc¬ 
uments  based  on  LSA  results.  In  addition,  it  needed  to  be  able  to  load  the  contents  of  each 
document.  The  data  is  from  another  project,  and  therefore  the  LSA  is  not  part  of  this  bench¬ 
mark  application. 

The  second  task  was  to  display  the  graph,  color-coding  the  vertices  based  on  the  docu¬ 
ment’s  concept  labeled.  The  actual  layout  used  was  not  important,  however  the  vertices  and 
the  relationships  between  them  needed  to  be  apparent,  so  a  force-directed  layout  was  chosen 
for  all  three  implementations. 

The  application  also  needed  to  be  able  to  show  the  document  contents  when  the  vertices 
were  selected  in  the  graph  view.  This  was  to  show  the  toolkit’s  ability  to  interface  with  GUI 
frameworks  and  load  extra  data  alongside  the  graph  definition. 

To  demonstrate  the  analytical  capability,  we  looked  for  functionality  shared  between  all 
three  packages.  We  settled  on  using  adjacency  and  connectivity  functions,  as  described  in 
section  3.2.1.  The  benchmark  application  then  needed  to  use  the  adjacency  functionality  to 
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find  all  adjacent  vertices  and  mark  their  edges  with  a  different  color  when  the  concept  labels 
did  not  match.  The  connectivity  needed  to  be  used  to  mark  all  vertices  with  an  identified 
component,  to  compare  those  to  the  concept  labeling  results. 


Fig.  4. 1 .  Benchmark  application  implementation  in  Flare 


4.2.  Flare.  Prefuse  Flare  is  a  visualization  and  analysis  toolkit  written  in  ActionScript 
for  the  Adobe  Flex  3.0  platform  [5].  It  is  the  spiritual  successor  to  the  Java-based  Prefuse, 
and  contains  much  of  the  same  functionality  [13].  There  has  not  been  an  official  release 
since  January  24,  2009,  but  this  toolkit  still  had  a  larger  number  of  visualization  and  analytics 
functions  than  most  of  the  others. 

The  tutorial  document  contained  not  only  an  introduction  to  Flare’s  structure,  but  also  a 
getting  started  guide  on  writing  Adobe  Flash/Flex  applications.  The  API  reference  is  mostly 
complete;  almost  all  classes  and  methods  are  fully  documented. 

4.2.1.  Importing  Data.  Originally  we  were  going  to  use  the  delimited  text  parser  in 
Flare,  but  it  was  unclear  how  to  convert  the  resulting  general  data  sets  into  a  graph  object. 
The  examples  provided  all  used  hardcoded  data  in  the  source,  so  we  were  unable  to  use  this 
converter. 

Our  second  choice  was  to  use  the  built-in  JSON  parser  to  convert  the  data  into  Flare’s 
internal  format.  Unfortunately,  the  exact  structure  of  the  JSON  objects  required  is  undocu¬ 
mented. 

Instead  we  used  the  GraphML  converter,  which  handled  every  aspect  of  our  data.  We 
were  able  to  import  document  nodes,  including  the  contents  of  the  original  documents,  as 
well  as  the  edges  between  them.  However,  this  method  required  us  to  convert  our  data  from 
the  delimited  text  files  to  the  GraphML  format. 

4.2.2.  Visualization.  There  were  several  demos  and  sample  applications  showing  how 
to  create  a  force-directed  layout,  so  we  implemented  the  similarity  graph  will  no  issues.  There 
are  several  classes  such  as  the  ColorEncoder,  which  can  set  the  color  of  a  visualization  item 
based  on  a  property,  which  made  the  coloring  of  the  vertices  and  edges  a  simple  task. 

However,  selecting  vertices  in  this  view  was  slightly  awkward.  There  needed  to  be  a 
’’hit  area”  defined  to  for  each  vertex,  and  there  was  no  default  value  for  this.  Once  the  vertex 
selection  was  in  place,  we  added  a  simple  Flash  text  view  and  set  its  contents  to  that  of  the 
document  contents  of  the  selected  vertices. 

4.2.3.  Analytics.  Finding  connected  vertices  using  the  list  of  edges  was  very  straight¬ 
forward.  We  iterated  through  all  of  the  edges  in  the  graph,  examining  the  vertices  at  either 
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end.  The  next  step  was  simply  applying  a  ColorEncoder  with  a  filter,  and  all  of  the  desired 
edges  were  highlighted. 

Finding  the  connected  components  was  slightly  more  complex.  Flare  does  not  have  a 
built-in  component  identification  filter,  but  we  were  able  to  use  a  simple  iterative  approach 
using  spanning  trees  to  label  connected  components  within  the  graph. 


Fig.  4.2.  Benchmark  application  implementation  in  Tulip 


4.3.  Tulip.  The  implementation  of  the  application  in  Tulip  was  the  most  challenging  of 
the  three.  The  Tulip  project  consists  of  two  parts,  one  being  the  framework  application  which 
a  user  can  write  plug-ins  for,  and  the  second  being  a  C++  library  that  provides  external  access 
to  all  of  the  functionality  [30]. 

However,  despite  the  existence  of  the  library,  the  documentation  was  focused  almost 
entirely  on  the  application.  There  was  an  API  reference,  but  many  of  the  classes  and  methods 
were  undocumented.  Most  of  the  examples  provided  were  for  writing  plugins.  However,  by 
taking  a  look  at  the  source  code  of  the  main  application,  it  was  possible  to  find  the  information 
we  needed  to  write  the  benchmark  application. 

4.3.1.  Importing  Data.  Tulip  does  not  support  loading  arbitrary  data  into  its  system 
through  delimited  text  files,  so  we  were  forced  to  convert  our  graph  to  GMF.  However,  with 
the  GMF  importer  we  were  able  to  import  the  entire  graph  plus  the  document  contents  using 
it. 


4.3.2.  Visualization.  Faying  out  the  graph  in  a  view  initially  appeared  to  be  a  simple 
task.  There  was  a  force-directed  layout  plugin  that  looked  like  it  would  fit  our  needs  perfectly. 
However,  there  were  several  parameters  in  the  function  used  to  invoke  this  plugin  that  seemed 
to  be  optional  due  to  having  default  values,  and  there  was  no  documentation  explaining  oth¬ 
erwise,  but  this  was  not  the  case  for  this  plugin.  Only  careful  examination  of  the  source  for 
the  main  application  showed  the  necessary  arguments  to  get  the  layout  working. 

Attempting  to  color  the  vertices  presented  a  problem.  The  entire  application  would  crash 
due  to  what  appeared  to  be  a  bad  pointer  when  loading  the  plugin  to  color  vertices  based  on  a 
category  parameter.  We  were  forced  to  implement  our  own  method  for  coloring  the  vertices 
and  edges. 

We  used  an  interactor  plugin  to  implement  vertex  selection.  However,  there  was  no 
documentation  on  how  to  use  this  plugin  in  a  view.  After  looking  through  much  of  the  source 
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code,  we  finally  found  a  function  that  would  register  an  interactor  plugin  with  the  view,  and 
then  mouse  selection  was  enabled. 

Tulip  uses  Qt  as  its  GUI  library,  and  there  were  Qt  signals  [26]  defined  for  selection 
events.  However,  these  signals  were  never  actually  activated  in  any  of  the  source  code,  and 
instead  we  had  to  create  an  observer  for  the  ” viewS election”  property  of  the  graph  object. 

4.3.3.  Analytics.  Finding  adjacent  vertices  in  Tulip  was  the  same  process  as  in  Flare. 
We  simply  iterated  through  every  edge,  and  compared  the  document  groups  of  the  vertices  at 
either  end.  By  setting  the  property  we  used  in  our  custom  coloring  implementation,  we  were 
able  to  highlight  the  edges  easily. 

Identifying  connected  components  was  very  easy.  Tulip  provides  a  plugin  to  do  exactly 
this,  and  it  was  a  one-step  process  to  make  it  work.  The  plugin  set  a  property  to  a  component 
ID,  and  we  used  that  with  our  coloring  to  highlight  the  vertices  in  each  component. 


Fig.  4.3.  Benchmark  application  implementation  in  Titan 


4.4.  Titan.  Titan  has  a  significantly  different  architecture  than  the  other  toolkits.  Instead 
of  calling  methods  on  data  as  you  need  to,  it  uses  the  same  pipeline  architecture  as  VTK 
[29,  28].  The  developer  uses  a  chain  of  filters  to  process  the  data  at  each  step  in  the  pipeline. 
For  example,  the  output  of  a  table  reader  is  connected  to  the  input  of  a  filter  that  converts 
tables  to  graphs,  and  the  graph  output  of  that  filter  is  connected  to  the  input  for  a  graph  layout 
view. 

Since  Titan  sits  on  top  of  VTK,  a  lot  of  its  functionality  actually  comes  from  the  VTK 
library.  Because  of  this,  the  documentation  for  these  functions  is  very  mature.  However,  it 
is  a  relatively  new  project,  so  the  documentation  for  the  new  functionality  that  is  specific  to 
Titan  is  not  complete,  and  there  are  not  as  many  examples  to  draw  from. 

4.4.1.  Importing  Data.  We  were  able  to  import  our  tabular  data  immediately  using  a 
few  table  readers.  We  connected  the  output  of  these  to  the  input  ports  of  a  table  to  graph 
conversion  filter.  The  output  of  this  we  then  connected  to  other  filters  and  then  to  a  graph 
layout  view  to  display  the  graph. 

4.4.2.  Visualization.  The  graph  layout  view  worked  immediately  after  setting  a  few 
properties.  We  needed  to  set  a  property  to  color  the  vertices  on  and  to  enable  vertex  coloring. 
However,  the  color  scheme  was  a  linear  mapping,  and  so  document  groups  that  were  adjacent 
in  number  were  also  very  similar  in  color  and  difficult  to  distinguish.  To  correct  this,  we  used 
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a  programmable  filter  to  implement  a  simple  shuffling  mapping  to  split  up  these  groups.  With 
this  the  colors  appeared  more  random,  and  achieved  the  results  we  were  looking  for. 

Vertex  selection  was  a  bit  more  complex.  There  are  mechanisms  for  communicating 
selections  between  views,  so  selecting  an  item  in  a  table  view  can  be  easily  reflected  in  a 
graph  view,  and  so  on.  However,  we  wanted  to  show  associated  text  from  a  document  in  a 
simple  Qt  text  browser  view,  and  there  was  no  built-in  mechanism  for  doing  that.  The  solution 
was  a  class  that  adapts  selection  events  to  Qt  signals,  which  we  were  able  to  use  to  get  the 
vertex  selection  and  update  the  text  browser  view  accordingly. 

4.4.3.  Analytics.  To  find  the  mismatched  documents,  we  again  created  a  programmable 
filter  that  iterated  through  all  edges  and  compared  concept  groups  of  the  vertices  at  each  end. 
By  setting  a  property  used  for  edge  coloring,  we  were  able  to  highlight  the  mismatched  edges. 
To  find  the  connected  components,  we  simply  connected  a  filter  that  identifies  components 
and  labels  them,  and  used  the  property  with  the  label  to  highlight  the  components  in  the  graph 
view. 

The  pipeline  model  seems  awkward  to  use  in  interactive  environments.  We  implemented 
the  analysis  functionality  by  splicing  two  filters  into  the  pipeline.  There  were  no  alternative 
methods  apparent  in  the  documentation. 

5.  Conclusion.  Every  toolkit  has  its  strengths  and  its  shortcomings.  Some  toolkits,  such 
as  Protovis  and  birdeye  clearly  offer  powerful  visualization  features.  Others,  like  R  and  Net- 
workX  offer  strong  analysis  functionality.  Many  of  these  toolkits  focus  to  one  particular  area, 
such  as  graphs/networks,  statistics  or  geospatial  analysis.  Some  are  targeted  for  a  particular 
platform,  such  as  a  web  browser. 

There  is  no  clear  winner  in  the  three  toolkits  we  reviewed  in  depth.  Flare  offers  a  power¬ 
ful,  interactive  experience  on  a  stable  and  portable  web  platform.  Tulip  has  a  powerful  desktop 
application,  but  the  lack  of  documentation  for  its  library  components  prevents  it  from  being 
a  true  contender.  Titan  brings  both  visualization  and  analysis  to  many  platforms  covering  a 
broad  range  of  functionality,  but  its  massive  size  may  be  intimidating  or  overwhelming  to 
new  users. 
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OPTIKA:  A  GUI  FRAMEWORK  FOR  PARAMETERIZED  APPLICATIONS 

KURTIS  L.  NUSBAUM*  AND  MICHAEL  A.  HEROUX’ 

Abstract.  In  the  field  of  scientific  computing  there  are  many  specialized  programs  designed  for  specific  appli¬ 
cations  in  areas  like  biology,  chemistry,  and  physics.  These  applications  are  often  very  powerful  and  extraodinarily 
useful  in  their  respective  domains.  However,  many  suffer  from  a  common  problem:  a  non-intutive,  poorly  designed 
user  interface.  Many  of  these  programs  are  homegrown,  and  the  concern  of  the  designer  was  not  ease  of  use  but 
rather  functionality.  The  purpose  of  Optika  is  to  address  this  problem  and  provide  a  simple,  viable  solution.  Using 
only  a  list  of  parameters  passed  to  it,  Optika  can  dynamically  generate  a  GUI.  This  allows  the  user  to  specify  param¬ 
eter’s  values  in  a  fashion  that  is  much  more  intuitive  than  the  traditional  ’’input  decks”  used  by  many  parameterized 
scientific  applications.  By  leverageing  the  power  of  Optika,  these  scientific  applications  will  become  more  accessible 
and  thus  allow  their  designers  to  reach  a  much  wider  audience  while  requiring  minimal  extra  development  effort. 


1.  Introduction.  In  the  world  of  scientific  computing  there  is  a  problem:  most  software 
developers  are  far  more  concerned  with  the  functionality  of  their  software  rather  than  their 
user  interface.  This  is  understandable  given  the  limited  time  and  pressures  of  scientific  com¬ 
puting  environments.  And  in  cases  where  there  are  only  a  few  users  of  a  piece  of  software 
this  type  of  development  is  tolerable.  However,  when  a  piece  of  software  starts  to  be  used 
by  a  wider  audience,  poor  user  interface  design  issues  come  to  the  forefront  and  can  greatly 
hinder  further  adoption  of  a  particular  piece  of  software.  Optika1  is  an  attempt  to  solve  this 
problem  in  a  generic  fashion  for  parameterized  scientific  applications. 

Since  developers  of  scientific  applications  don’t  really  care  about  user  interfaces,  Optika 
needs  to  provide  a  minimal  amount  of  hurdles  for  developers.  Also,  the  end  result  needs 
to  be  an  intuitive  user  interface  that  can  be  easily  navigated  and  utilized  regardless  of  the 
underlying  computations  being  done. 

The  purpose  of  this  paper  is  to  discuss  the  development  of  the  Optika  package.  In  doing 
so  we  hope  to  demonstrate  how  Optika  solved  some  of  the  issues  associated  with  developing 
a  generic  user  interface  for  scientific  applications  and  provide  justification  for  why  we  chose 
particular  solutions.  We  will  proceed  to  discuss  Optika  development  in  a  semi-chronological 
fashion. 

2.  Initial  Planning  and  Development.  In  Fall  of  2008,  Dr.  Mike  Heroux  identified  a 
need  for  the  Trilinos  Framework2  to  include  some  sort  of  GUI  package.  Dr.  Heroux  wanted 
to  give  users  of  the  framework  the  ability  to  easily  generate  GUIs  for  their  programs,  while 
still  providing  a  good  experience  for  the  end-user.  Based  on  previous  GUI  work  we’d  done 
for  the  Tramonto  project  [6],  a  few  initial  problems  were  identified: 

•  How  would  the  GUI  be  laid  out? 

•  What  GUI  framework  would  would  be  used  to  build  the  GUI? 

•  Different  types  of  parameters  require  different  methods  of  input.  How  would  the 
program  decide  how  to  obtain  input  for  a  particular  parameter? 

•  How  would  the  application  developer  specify  parameters  for  the  GUI  to  obtain? 

•  How  would  the  application  developer  specify  dependencies  between  parameters. 
This  was  a  crucial  problem/needed-feature  that  was  identified  in  previous  devel¬ 
opment  of  an  unsuccessful  Tramonto  GUI. 

After  some  deliberation,  the  following  initial  solutions  were  decided  upon: 
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Fig.  2.1.  The  hierarchical  layout  of  the  GUI 


•  The  GUI  would  be  laid  out  in  a  hierarchical  fashion  as  shown  in  Figure  2.1.  Pa¬ 
rameters  would  be  organized  into  lists  and  sublists.  This  would  allow  for  a  clear 
organization  of  the  parameters  as  well  as  intrinsically  demonstrate  the  relationships 
between  them. 

•  QT3  was  chosen  as  the  GUI  framework  for  several  reasons: 

-  It  is  cross-platform. 

-  It  is  mature  and  has  a  comprehensive  set  of  development  tools. 

-  It  has  a  rich  feature- set. 

-  It  has  been  used  by  Sandia  in  the  past. 

-  The  Optika  lead  developer  was  familiar  with  it. 

•  It  would  be  required  that  all  parameters  specify  their  type  and  the  following  types 
would  be  accepted: 


-  int 

-  short 

-  float 

-  double 


-  string 

-  boolean 

-  arrays  of  int,  short,  double,  and 
string 


The  supported  data  type  were  chosen  for  two  main  reasons:  (a)  The  number  of  input 
widgets  that  need  to  be  supported  is  a  direct  function  of  the  supported  data  types. 
By  only  supporting  a  small  set  of  basic  data  types,  we  could  stick  to  supporting  only 
a  small  number  of  input  widgets,  most  of  which  were  already  pre-built  and  part  of 
Qt.  (b)  The  development  team  felt  that  these  data  types  would  be  adequate  for  95% 
of  the  developers  who  would  be  using  Optika. 

For  number  types,  a  spin  box  (figure  2.2(a))  would  be  used  as  input.  If  the  valid 
values  for  a  string  type  were  specified,  a  combo  box  (figure  2.2(b))  would  be  used. 
Otherwise  a  line  edit  (figure  2.2(c))  would  be  used.  For  booleans,  a  combo  box 
(firgure  2.2(b))  would  also  be  used.  For  arrays,  a  pop-up  box  containing  numerous 
input  widgets  would  be  used.  The  widget  type  would  be  determined  by  the  array 
type  (e.g.  for  numerical  types  a  series  of  spinboxes  would  be  used). 

•  Initially  it  was  decided  that  the  application  developer  would  specify  parameters  via 
an  XML  file.  A  DTD  would  be  created  specifying  the  legal  tags  and  namespaces. 


For  documentation  on  all  Qt  classes  please  visit  [2]. 
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Cleanlooks  style 


Enter  your  name 


(a)  A  Spin  Box  (b)  A  Combo  Box  (c)  A  Line  Edit 

Fig.  2.2.  Some  of  the  various  widgets  used  for  editing  data  [4] 


•  Dependencies  would  be  handled  through  special  tags  in  the  DTD. 

3.  Early  Development.  The  first  several  months  of  development  were  spent  on  creating 
and  implementing  the  XML  specification.  The  name  of  the  XML  specification  went  through 
several  revisions  but  was  eventually  called  Dependent  Parameter  Markup  Language  (DPML). 

After  several  months  of  development  we  realized  that  creating  an  entirely  new  way  of 
specifying  parameters  might  hinder  adoption  of  Optika.  We  also  realized  that  Trilinos  actually 
had  a  ParameterList  class  in  the  Teuchos  [5]  package.  The  ParameterList  seemed  to  be  better 
than  DPML  for  several  reasons: 

•  It  was  already  heavily  adopted. 

•  It  had  the  necessary  hierarchical  nature  (like  described  in  figure  2.1). 

•  It  was  serializable  to  and  from  XML. 

For  these  reasons,  DPML  was  scrapped  in  favor  of  using  Teucho’s  ParameterLists.  De¬ 
velopment  moved  forward  with  the  goal  of  creating  a  GUI  framework  that,  in  addition  to 
meeting  all  the  challenges  outlined  above,  would  also  be  compatible  with  any  existing  pro¬ 
gram  using  Teuchos’ s  ParameterLists. 

4.  Heavy  development.  Starting  in  May  2009  a  more  heavy  focus  was  put  on  devel¬ 
opment  of  the  Trilinos  GUI  package.  With  the  back-end  data- structure  of  the  Teuchos  Pa¬ 
rameterList  already  in  place,  attention  was  turned  to  developing  the  actually  GUI  portions 
of  the  framework.  A  key  technology  provided  by  Qt  was  it’s  Model/View  framework  [1]. 
Using  the  Model/View  paradigm,  a  wrapper  class  named  TreeModel  was  created  around  the 
ParameterList  class  by  subclassing  QAbstractltemModel. 

However,  in  subclassing  the  QAbstractltemModel  it  was  realized  that  the  ParameterList 
class  fell  short  in  terms  of  providing  certain  features.  The  main  issue  was  that  a  given  Pa- 
rameterEntry  located  within  a  ParameterList  or  a  given  sublist  located  within  a  ParameterList 
was  not  aware  of  it’s  parent.  This  was  an  issue  because  Qt’s  Model/View  framework  re¬ 
quires  items  within  a  model  to  be  aware  of  their  parents.  In  order  to  circumvent  this  issue 
the  Treeltem  class  was  created.  Now  the  TreeModel  class  became  more  than  just  a  simple 
wrapper  class.  A  TreeModel  was  created  by  giving  it  a  ParameterList.  It  would  then  read 
in  the  ParameterList  and  create  a  structure  of  Treeltems.  Each  Treeltem  then  contained  a 
pointer  to  it’s  corresponding  ParameterEntry.  This  allowed  parent-child  relationship  data  to 
be  maintained  while  still  using  ParameterLists  as  the  true  backend  data- structure. 

Once  the  TreeModel  and  Treeltem  class  were  complete,  an  appropriate  delegate  to  go 
between  a  view  and  the  TreeModel  was  needed.  A  new  class  simply  called  Delegate  was 
created  to  fill  this  role  by  subclassing  QltemDelegate.  As  specified  above,  the  delegate  would 
return  the  appropriate  editing  widget  based  on  the  data  type  carried  within  a  given  Treeltem. 

With  the  model  and  delegate  classes  in  place,  an  appropriate  view  could  be  applied.  At 
first  a  simple  QTreeView  was  applied  to  the  model.  Later,  as  additional  functionality  was 
added  the  view  class  needed  to  perform  more  functions.  To  fill  these  needs,  the  QTreeView 
class  was  subclassed,  creating  the  Tree  View  class.  Its  main  duties  were  to  show  and  hide 
parameters  as  needed  and  handle  any  bad  parameter  values  that  might  come  up  during  the 
course  of  the  GUI  execution.  These  features  were  needed  due  to  requirements  that  arose  from 
dependencies  (something  that  will  be  discussed  later). 
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Finally,  the  OptikaGUI  class  was  created.  It  had  one  static  function,  getlnput.  A  Param- 
eterList  is  passed  to  this  function,  a  GUI  is  generated,  and  all  end-user  input  is  stored  in  the 
ParameterList  that  was  passed  to  the  function.  When  the  end-user  hits  the  submit  button  the 
GUI  closes  and  the  ParameterList  that  was  passed  to  the  getlnput  function  now  contains  all 
of  the  end-user  input.  The  end  result  was  something  like  that  in  figure  4. 


ftOO  Parameter  Input 


Parameter  Value  Type 

Enable  Delayed  Solver  Construction  false  bool 

Linear  Solver  Type  Amesos  string 

►  Linear  Solver  Types  list 

Matrix  File  memplus.mtx  string 

Preconditioner  Type  Ifpack  string 

►  Preconditioner  Types  list 

Tolerance  le-05  double 


{>r  Submit  ^ 


Fig.  4.1.  The  end  result  of  the  initial  Optika  development 


5.  Advanced  Features.  With  the  basic  framework  in  place,  we  were  now  able  to  move 
on  to  more  advanced  features.  As  these  advanced  features  were  developed  various  refactor¬ 
ings  were  made  to  the  already  existing  code  in  order  to  support  these  new  features. 

5.1.  Validators.  One  of  the  goals  of  Optika  is  to  make  life  easier  for  the  end-user.  It’s 
not  enough  to  simply  give  the  end-user  information,  it  must  be  conveyed  in  a  meaningful  way. 
Validators  are  a  great  way  of  informing  an  end-user  what  the  valid  set  of  values  for  a  particular 
parameter  are.  Teuchos  ParameterLists  already  came  with  built  in  validator  functionality,  but 
the  default  validators  that  were  available  were  sorely  lacking  in  capability.  Three  initial  sets  of 
validators  were  created  to  help  deal  with  the  short  comings  of  the  available  validator  classes: 
EnhancedNumberValidators  allowed  for  validating  various  number  types.  EnhancedNum- 
ber Validators  have  the  following  abilities: 

•  Set  min  and  max. 

•  Set  the  step  with  which  the  number  value  is  incremented. 

•  Set  the  precision  with  which  the  number  value  is  displayed. 

String  Validator  allowed  for  a  parameter  to  be  designated  as  only  accepting  values  of  type 
string  and  allowed  for  specifying  a  valid  list  of  values. 

Array  Validators  allowed  for  all  validator  types  to  be  applied  to  an  array  of  values.  The 
validator  that  is  applied  to  each  entry  in  the  array  is  called  the  prototype  validator. 

A  fourth  Validator  type,  a  FileName Validator,  was  added  later.  This  validator  designates 
a  particular  string  parameter  as  containing  a  file  path  and  allows  the  developer  to  indicate 
whether  or  not  the  file  must  already  exist.  Since  filenames  are  such  an  important  type  of 
string,  it  made  sense  that  they  would  have  their  own  validator. 

By  interpreting  these  validators,  Optika  could  either  put  certain  restrictions  on  the  in¬ 
put  widget  for  a  parameter  or  entirely  change  the  type  of  input  widget  used.  For  instance: 
with  EnhancedNumberValidators  the  min,  max,  step,  and  precision  of  the  EnhancedNumber- 
Validator  are  all  to  directly  set  their  corresponding  values  in  the  QSpinBox  class.  But  with 
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the  FileName Validator  a  QFileDialog  would  appear  instead  of  the  normal  QComboBox  or 
QTextEdit  used  for  string  validators. 

5.2.  Dependencies.  Many  times  the  state  of  one  parameter  depends  on  the  state  of  an¬ 
other.  Common  inter-parameter  dependencies  and  their  requirements  include: 

Visual  Dependencies:  One  parameter  may  become  meaningless  when  another  parameter 
takes  on  a  particular  value.  In  this  case  the  end-user  no  longer  needs  to  be  aware 
of  the  meaningless  parameter  and  it’s  best  to  just  remove  it  from  their  view  entirely 
so  they  don’t  potentially  become  confused.  Visual  dependencies  should  allow  the 
developer  to  express  that  ”if  parameter  x  takes  on  a  particular  value,  then  don’t  dis¬ 
play  parameter  y  to  the  end-user  anymore.” 

Validator  Dependencies:  Sometimes  the  valid  set  of  values  for  one  parameter  changes  if 
another  parameter  takes  on  a  particular  value.  Validator  Dependencies  should  allow 
the  developer  to  express  that  ”if  parameter  x  takes  on  a  particular  value,  change  the 
validator  on  parameter  y.” 

Validator  Aspect  Dependencies:  Sometimes  the  developer  doesn’t  want  to  change  the  val¬ 
idator  on  a  particular  parameter,  but  rather  just  a  certain  aspect  of  it.  Validator  Aspect 
Dependencies  should  allow  the  developer  to  express  that  ”if  parameter  x  takes  on  a 
particular  value,  change  this  aspect  of  the  validator  on  parameter  y  based  on  the  new 
value  of  parameter  x” 

Array  Length  Dependencies:  Sometimes  the  length  of  an  array  in  a  parameter  changes 
based  on  the  value  of  another  parameter.  Array  Length  Dependencies  should  allow 
the  developer  to  express  that  ”if  parameter  x  changes  its  value,  change  the  length  of 
the  array  in  parameter  y  based  on  the  new  value  of  parameter  x.” 

Coming  up  with  a  way  for  the  developer  to  easily  express  these  concepts  was  not  a  simple 
task.  The  first  problem  that  had  to  be  solved  was  how  to  keep  track  of  all  the  dependencies. 
They  couldn’t  just  be  stored  in  a  ParameterList  as  a  class  member  because  of  the  recursive 
structure  of  ParameterLists.  Eventually,  it  was  decided  that  a  new  data  structure  called  a 
DependencySheet  would  hold  all  the  dependencies  used  for  a  certain  ParameterList.  Each 
Dependency  would  at  minimum  specify  the  dependent  parameter  and  the  dependee  param¬ 
eter.  However,  a  complication  arose.  Because  we  wanted  dependencies  to  be  able  to  have 
arbitrary  dependents  and  dependees,  we  needed  a  way  to  uniquely  identify  the  dependee  and 
the  dependent.  As  the  ParameterEntry  class  stood,  there  was  no  way  of  doing  this  and  we 
didn’t  want  to  add  this  cabaility  to  the  ParameterEntry  class.  The  Teuchos  package  is  funda¬ 
mental  to  the  Trilinos  Project  and  before  we  started  changing  it  for  our  purposes  we  wanted 
Optika  to  have  a  solid  footing  and  be  absolutely  sure  that  any  changes  made  to  Teuchos  were 
actually  necessary.  We  needed  to  find  another  way  to  uniquely  identify  parameters  within  a 
ParameterList. 

We  decided  to  use  the  name  of  the  parameter  as  the  identifier  because  the  accessor  func¬ 
tions  for  a  ParamaeterList  usually  use  the  parameters  name  to  identify  a  particular  parameter. 
While  within  a  ParameterList  names  of  parameters  are  unique,  names  are  not  necessarily 
unique  across  a  set  of  sublists.  Therefore,  in  order  to  uniquely  identify  a  parameter  and  allow 
dependencies  across  sublists  Optika  would  need  to  know  both  the  parameter  name  and  the 
parent  list  containing  it4. 

So  it  became  that  every  dependency,  along  with  needing  the  names  of  the  dependee 
and  dependent,  also  needed  their  respective  parent  lists.  The  DependencySheet  also  needed 


4  The  astute  reader  will  notice  that  if  there  are  two  sublists  with  different  parent  lists  and  each  sublist  has  a 
parameter  with  the  same  name,  then  Optika  will  not  be  able  to  uniquely  identify  the  dependent  and  the  dependee. 
Since  solving  this  problem  would  most  likely  require  a  lot  of  refactoring  of  code  not  directly  part  of  the  Optika 
package,  we  decided  to  address  it  at  a  later  date. 
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the  root  list  which  contained  all  of  the  dependees  and  dependents.  This  was  so  Optika  could 
recursively  search  for  the  parameters  and  their  parent  sublists  (the  only  way  to  find  them  using 
our  method  of  identification).  The  following  Dependency  classes  were  created  to  address  the 
use  cases  above  (shown  as  a  hierarchy  of  classes): 

Dependency:  Parent  class  for  all  Dependencies. 

NumberArrayLengthDepednency:  Changes  an  array’s  length. 

Number  Validator  AspectDependency<T>:  Changes  various  aspects  of  an  Enhanced- 

NumberValidator. 

ValidatorDependency:  Changes  the  validator  used  for  particular  parameter. 

BoolValidatorDependency:  Changes  the  validator  used  for  a  particular  pa¬ 
rameter  based  on  a  boolean  value. 

Range  ValidatorDependency  <T>:  Changes  the  validator  used  for  a  particu¬ 
lar  parameter  based  on  a  number  value. 

String  ValidatorDependency:  Changes  the  validator  used  for  a  particular  pa¬ 
rameter  based  on  a  string  value. 

VisualDependency:  Shows  or  hides  a  particular  parameter. 

BoolVisualDepedency:  Shows  or  hides  a  particular  parameter  based  on  a 
boolean  value. 

NumberVisualDependency  <T> :  Shows  or  hides  a  particular  parameter  based 
on  a  supported  number  type  value. 

String  VisualDependency:  Shows  or  hides  a  particular  parameter  based  on  a 
string  value. 

Some  of  these  dependencies  have  fairly  novel  and  robust  capabilities.  Namely,  the  Num¬ 
berArrayLengthDepednency,  NumberValidatorAspectDependency,  and  NumberVisualDepen- 
dencies  can  all  take  a  pointer  to  a  function  as  an  argument.  In  the  case  of  the  NumberAr¬ 
rayLengthDepednency,  this  function  can  be  applied  to  the  value  of  the  dependee  parameter. 
The  return  value  of  this  function  is  then  used  as  the  length  of  the  array  for  the  dependent  pa¬ 
rameter.  Lor  Number  Validator AspectDependencies,  the  function  is  applied  to  the  dependee 
value  and  used  to  calculate  the  value  of  the  chosen  validator  aspect.  In  the  NumberVisualDe- 
penency  class,  if  the  function  when  applied  to  the  dependee  value  returns  a  value  greater  than 
0  the  dependent  is  displayed.  Otherwise,  the  dependent  is  hidden. 

The  algorithm  for  expressing  dependencies  in  the  GUI  is  as  follows: 

1.  A  parameter’s  value  is  changed  by  the  end-user. 

2.  The  Treemodel  queries  the  associated  dependency  sheet  to  see  whether  or  not  the 
parameter  that  changed  has  any  dependents. 

3.  If  the  parameter  does  have  dependents,  the  Treemodel  requests  a  list  of  all  the  de¬ 
pendencies  in  which  the  changed  parameter  is  a  dependee. 

4.  Lor  each  dependency,  the  evaluate  function  is  called.  The  dependency  makes  any 
necessary  changes  to  the  dependent  parameter  and  the  Treemodel  updates  the  Tree- 
view  with  the  new  data. 

5.  If  any  dependents  now  have  invalid  values,  focus  is  given  to  them  and  the  end-user 
is  requested  to  change  their  value  to  something  more  appropriate. 

The  order  in  which  dependencies  are  evaluated  is  arbitrary.  We  have  not  tested  what 
happens  under  the  conditions  of  conflicting  dependencies  or  circular  dependencies  yet.  This 
is  an  area  for  further  study. 

5.3.  Custom  Functions.  Normally,  in  Optika  the  end-user  configures  the  ParameterList, 
hits  submit,  the  GUI  disappears,  and  the  program  continues  with  execution.  However,  an  al¬ 
ternative  to  this  work  flow  was  desired.  A  persistent  GUI  was  needed.  The  development  team 
added  the  ability  to  specify  a  pointer  to  a  function  that  would  be  executed  whenever  the  end- 
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user  hit  submit.  The  function  was  required  to  have  the  signature  foo(Teuchos: :RCP<const 
ParameterList>  userParameters). 

5.4.  Various  Niceties.  Various  niceties  were  added  to  the  GUI  as  well.  The  ability  to 
save  and  load  ParameterLists  was  added.  The  Optika  GUI  class  was  expanded  to  allow  for 
customization  of  the  window  icon  and  use  of  Qt  Style  Sheets  to  style  the  GUI.  Checks  were 
also  added  to  see  if  the  end-user  was  trying  to  exit  the  GUI  without  saving.  In  such  a  case 
they  would  be  warned  and  given  the  option  to  save  their  work.  Tooltips  were  added  so  when 
ever  the  end-user  hovered  over  a  parameter,  it’s  documentation  string  would  be  displayed. 
Also,  the  ability  to  search  for  a  parameter  was  added. 

6.  Waiting  For  Copyright.  All  of  the  above  features  were  completed  around  or  shortly 
after  the  end  of  August  2009  and  Optika  was  officially  given  its  name.  Optika  was  then 
submitted  for  copyright.  It  took  Optika  a  little  over  six  months  to  complete  copyright.  Since 
it  was  not  yet  copyrighted,  it  could  not  be  included  in  the  Trilinos  10  release  in  October  2009. 
During  the  time  Optika  spend  in  copyright  limbo,  little  development  on  Optika  was  done. 
Most  of  development  was  cleaning  up  various  pieces  of  code,  adding  examples,  and  adding 
documentation.  Finally,  in  March  2010  Optika  completed  copyright  and  was  ready  to  be 
included  in  Trilinos.  It  was  released  to  the  public  with  the  Trilinos  10.2  release. 

7.  User  Feedback.  In  the  summer  of  2010,  Optika  got  it’s  first  user.  Dr.  Laurie  Frink 
began  using  Optika  to  create  a  GUI  for  Tramonto.  There  had  been  a  previous  attempt  to  build 
a  GUI  for  Tramonto,  but  it  had  been  largely  unsuccessful 

Initial  feedback  was  very  positive.  Dr.  Frink  was  very  impressed  with  the  capabilities  of 
Optika  and  the  ease  at  which  should  could  construct  a  GUI.  However,  she  did  have  some  small 
initial  issues  picking  up  Optika.  But  most  of  them  arose  from  the  fact  she  is  a  C  programer 
and  Optika  is  C++  based.  Her  issues  were  always  easily  and  quickly  addressed.  Some  of  her 
more  involved  questions  even  lead  to  the  creation  of  some  great  examples. 

For  the  most  part  Dr.  Frink  found  Optika  to  be  quite  adequate  for  her  purposes.  However, 
she  did  have  one  rather  major  feature  request:  she  needed  the  ability  to  specify  multiple 
dependents  and  in  some  cases  even  multiple  dependees.  This  was  quite  a  task  and  required  a 
large  reworking  of  the  Dependency  part  of  the  framework. 

Adding  support  for  multiple  dependents  was  fairly  trivial.  Instead  of  specifying  a  single 
dependent  to  the  constructor  of  a  Dependency,  a  list  of  Parameters  was  now  passed.  If  the 
developer  only  needed  one  dependent  then  he/she  could  just  pass  a  list  of  length  one.  This 
simple  list  worked  in  the  case  of  all  the  dependents  having  the  same  parent  list.  If  they  had 
different  parent  lists,  then  a  more  complex  data  structure  which  mapped  parameters  to  parent 
lists  would  be  used.  Convenience  constructors  were  also  made  for  simple  cases  where  just 
one  dependent  was  needed.  The  algorithm  used  for  evaluating  dependencies  changed  very 
little  with  these  modifications.  The  only  addition  needed  was  an  extra  loop  for  evaluating 
each  dependent  in  a  dependency  for  a  given  dependee. 

Adding  support  for  multiple  dependents  was  much  harder.  There  was  actually  only  one 
specific  use  case  where  multiple  dependents  were  needed  or  even  appropriate  for  that  matter. 
Dr.  Frink  needed  the  ability  to  test  the  condition  of  multiple  parameters  to  determine  whether 
or  not  a  particular  parameter  should  be  displayed.  So  a  new  VisualDependency  class  called 
Condition VisualDependency  was  created.  Condition VisualDependencies  evaluated  a  condi¬ 
tion  object  to  determine  the  whether  or  not  a  set  of  dependents  should  be  hidden  or  shown. 
The  set  of  condition  classes  created  are  as  follows  (shown  as  a  hierarchy  of  classes): 
Condition  :  Parent  class  of  all  conditions. 

ParameterCondition  :  examines  the  value  of  a  particular  parameter  and  evaluates 
to  true  or  false  accordingly.  Types  of  ParameterConditions  include: 
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BoolCondition:  examines  boolean  parameters. 

NumberCondition<T>:  examines  number  parameters. 

StringCondition:  examines  string  parameters. 

Binary LogicalCondition:  examines  the  value  of  two  or  more  conditions  passed  to 
it  and  evaluates  to  true  or  false  accordingly.  Types  of  BinaryLogicalConditions 
include: 

AndCondition:  returns  the  equivalent  of  performing  a  logical  AND  on  all 
conditions  passed  to  it. 

EqualsCondition:  returns  the  equivalent  of  performing  a  logical  EQUALS  on 
all  conditions  passed  to  it. 

Or  Condition:  returns  the  equivalent  of  performing  a  logical  OR  on  all  condi¬ 
tions  passed  to  it. 

NotCondition:  examines  the  value  of  one  condition  passed  to  it  and  evaluates  to 
the  opposite  of  what  ever  that  condition  evaluates. 

Through  the  recursive  use  of  BinaryLogicalConditions  the  developer  can  now  chain  together 
an  arbitrary  amount  of  dependents. 

Condition VisualDependencies  are  the  only  dependencies  which  allow  for  multiple  de¬ 
pendents.  So  while  support  was  added  for  multiple  dependents  at  the  Dependency  parent  class 
level,  Condition VisualDependency  is  the  only  class  which  actually  implements  the  function¬ 
ality.  In  the  case  of  multiple  dependents  the  algorithm  for  evaluating  dependencies  didn’t 
need  to  change  at  all. 

At  the  time  of  this  publication,  the  Optika  team  is  still  waiting  to  hear  back  from  Dr. 
Lrink  as  to  whether  or  not  these  new  features  meet  her  needs. 

8.  Future  Development.  There  are  several  main  development  goals  for  Optika  in  the 
near  future. 

•  We  would  like  to  give  users  the  option  to  completely  write  an  Optika  GUI  (with 
dependencies  and  validators)  solely  in  XML.  Our  hope  is  that  since  writing  XML 
is  simpler  than  writing  C++  this  will  further  ease  the  demands  of  the  application 
developer.  In  order  to  achieve  this  goal  XML  serialization  for  all  of  the  validator  and 
dependency  related  classes  must  be  developed.  We  have  already  started  working  on 
serialization  this  summer  and  currently  XML  serialization  for  validators  is  almost 
finished  after  which  serialization  for  the  dependency  and  dependency  sheet  class 
will  begin. 

•  We  would  like  to  develop  a  stand-alone  version  of  Optika.  The  development  team 
believes  that  the  potential  audience  for  Optika  is  much  larger  than  just  the  user  base 
of  Trilinos.  However,  creating  a  stand-alone  version  presents  the  problem  of  keeping 
source  code  consistent  between  the  Optika  that  exists  in  Trilinos  and  the  stand-alone 
version.  This  is  an  issue  that  we  will  need  to  make  sure  to  address. 

•  We  would  like  to  create  a  single  Optika  based  executable  that  acts  as  a  generic  Par- 
materList  configurator.  It  would  take  in  a  ParameterList  in  XML  format,  allow  the 
user  to  configure  the  ParameterList,  and  then  either  output  the  entire  ParameterList 
again  with  the  new  settings  or  output  a  ParameterList  only  containing  the  parameters 
that  were  changed.  We  think  this  will  be  useful  because  it  will  enable  end-users  to 
utilize  Opitka  without  requiring  the  program  their  using  to  implement  Optika  sup¬ 
port  (just  ParameterList  support). 

•  An  absolute  and  simpler  way  to  uniquely  identify  dependees  and  dependents  is 
needed.  The  current  system  right  now  is  clunky  and  doesn’t  offer  us  guaranteed 
identification. 

•  Do  a  more  thorough  use  case  analysis.  Initially,  only  a  few  people  were  consulted  on 
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the  potential  requirements  of  such  a  generic  GUI.  A  broader  selection  of  potential 
users  would  help  us  determine  how  we  can  make  Optika  more  useful  in  a  much 
boarder  context. 

•  We  need  to  further  investigate  what  happens  when  dependencies  conflict  or  become 
circular.  Right  now  the  behavior  of  Optika  under  such  conditions  is  unknown. 

9.  Conclusions.  It  is  difficult  to  tell  if  Optika  has  met  it’s  initial  goals  yet.  As  of  the 
writing  of  this  paper,  Optika  only  has  one  user.  Hopefully,  by  continuing  to  do  further  devel¬ 
opment  and  evangelization  it’s  user  base  can  grow.  This  will  then  allow  us  to  see  if  we  truly 
are  meeting  the  needs  of  the  scientific  community.  Based  on  early  use  of  Optika  by  Dr.  Frink 
we  believe  that  Optika  is  indeed  robust  enough  to  meet  most  of  the  community’s  needs  but 
we  can’t  say  for  sure  until  we  have  more  user  testing. 
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Appendix 

A.  Nomenclature. 

Dependency  A  relationship  between  two  or  more  parameters  in  which  the  state  or  value  of 
one  set  of  parameters  depends  on  the  state  or  value  of  another. 

Dependee  The  parameter  upon  which  another  parameter’s  state  or  value  dependes. 
Dependent  A  parameter  whose  state  or  value  is  determined  by  another  parameter. 
Parameter  An  input  needed  for  a  program. 

ParameterList  A  class  containing  a  list  of  parameters  and  other  parameter  lists. 
ParameterEntry  A  class  containing  a  parameter  located  in  a  ParameterList 
RCP  Refernce  counted  pointer.  RCPs  refered  to  in  this  document  reference  the  RCP  class 
located  in  the  Teuchos  [5]  package  of  Trilinos. 

Sublist  A  parameter  list  contained  within  another  parameter  list. 

Widget  A  GUI  element,  usually  used  to  obtain  user  input. 

Validator  An  object  used  to  ensure  a  particular  parameter’s  value  is  valid. 
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EPETRA/AZTECOO  AND  RELATED  TO  TPETRA/STRATIMIKOS  AND 
RELATED:  A  CONVERSION  GUIDE 

KELLY  J.  FERMOYLE^  AND  MICHAEL  A.  HEROUX11 

Abstract.  Tpetra  gives  users  additional  flexibility  for  using  different  data  types  and  computing  platforms  for 
example  allowing  users  to  choose  floats  instead  of  doubles  for  the  scalar  type  or  use  a  long  integer  for  practically 
limitless  size  of  operators.  The  flexibility  in  platforms  allows  for  use  with  graphics  processing  units  or  even  hybrid 
systems  that  mix  both  systems.  Tpetra  was  designed  to  be  easily  transitioned  from  experienced  Epetra  users  or  even 
new  users.  We  describe  some  design  differences  between  Tpetra  and  Epetra  including  the  typedef  keyword  and  the 
use  of  Teuchos  utilities.  This  guide  gives  examples  with  side-by-side  comparisons  of  common  classes.  The  examples 
can  be  compiled  to  create  a  simple  tri-diagonal  solver  in  both  Epetra  and  Tpetra.  This  guide  begins  with  the  most 
essential  classes  and  continues  to  classes  that  use  these  base  classes.  We  then  explain  the  steps  needed  to  wrap  the 
objects  in  Thyra  and  create  a  Stratimikos  solver. 


1.  Introduction.  Templated  code  allows  programmers  much  more  flexibility  to  easily 
use  data  types  that  are  most  appropriate  for  their  code.  This  is  especially  important  in  the 
design  of  library  code  that  is  used  for  many  different  applications.  Templated  design  is  a 
relatively  new  paradigm,  and  Trilinos  is  one  of  the  first  scientific  libraries  to  support  templated 
programming  [4]. 

Tpetra  allows  for  the  use  of  new  computing  environments  including  graphics  processors 
and  the  IBM  cell  processor  [5,  6].  These  new  platforms  can  deliver  incredible  computing 
rates,  but  many  existing  code  bases  cannot  easily  be  ported  to  harness  this  power.  Tpetra  is 
templated  on  the  node  type,  which  allows  the  programmer  to  specify  that  a  Thrust  GPU  node 
will  be  used,  for  example.  And  the  code  would  be  no  different  than  if  a  serial  CPU  were  used 
since  the  node  type  is  set  automatically. 

The  rest  of  this  guide  proceeds  as  follows.  Section  2  explains  some  of  the  design  deci¬ 
sions  in  Tpetra  as  well  as  tips  to  manage  code  and  understand  the  rest  of  the  guide.  Section  3 
describes  the  differences  between  Epetra  and  Tpetra  classes  and  includes  examples  of  creat¬ 
ing  them  and  using  important  methods.  Section  4  describes  the  steps  used  to  wrap  objects  in 
Thyra  solve  linear  systems  using  Stratimikos.  Finally,  section  5  concludes  the  guide. 


Example  1  Header.hpp:  The  keyword  ‘using’  allows  us  to  avoid  the  namespace  Teuchos 
before  RCP.  We  also  typedef  the  scalar  and  ordinal  types. 


#include  "Teuchos_RCP . hpp" 

2  //  To  avoid  providing  the  namespace  Teuchos  with  RCP  or  rep 

using  Teuchos ::  RCP; 

4  using  Teuchos : : rep ; 

6  //  Typedef  here  makes  it  clear  what  the  scalar  and  ordinal  are 

//  All  scalars  and  ordinals  can  be  changed  to  i.e.  float  in  one  line 

8  typedef  double  Scalar; 

typedef  int  LO; 

10  typedef  int  GO; 


2.  Background.  This  section  combines  some  tips  to  make  the  code  more  manageable 
as  well  as  some  of  the  standardized  features  used  in  Tpetra.  Typedefs  are  used  to  keep  the 
code  looking  neat.  Reference  counted  pointers  are  used  for  ease  in  memory  management  and 
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avoiding  memory  leaks  with  automatic  garbage  collection.  Finally,  we  discuss  the  design 
decisions  that  led  to  using  Teuchos  array  objects  instead  of  pointers  to  scalars. 


Example  2  TpetraTest.cpp:  The  driver  program  for  each  of  the  examples  in  this  guide.  This 
shows  the  order  that  the  guide  proceeds.  MPLInit  is  only  needed  for  Epetra.  We  use  RCP  for 
Tpetra  objects  which  uses  automatic  garbage  collection  so  they  do  not  need  to  be  deleted. 


#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 

#include 


"Header . hpp" 
"CommTest . hpp" 
"MapTest . hpp" 

" VectorTest . hpp" 
"MatrixTest . hpp" 
" SetMat . hpp" 

" Import Test . hpp" 
"ThyraTest . hpp" 

" SolveTest . hpp" 

" ArrayTest . hpp" 


int  main(int  argc  ,  char*  argv[]){ 

# if def  EPETRA_MPI 

MPI_Init(&argc  ,&argv  ) ; 

#endif 

CreateComm  ( ) ;  CreateMapQ;  CreateVecQ;  CreateMatQ;  CreatelmpQ; 

RandomizeVec  ( ) ;  UselmpQ;  SetMatQ;  WrapThyra();  Solve  ();  ArrayTest(); 
delete  epetra_comm ;  delete  epetra_map  ;  delete  epetra_vector  ; 
delete  epetra_lhs  ;  delete  epetra_multivector  ;  delete  epetra_matrix  ; 
delete  epetra_import  ;  delete  epetra_export  ;  delete  epetra_solver  ; 
MPI_Finalize  ( ) ; 


void  SimpleConstructor  ()  { 

//  This  code  shows  the  differ enc e  between  using  a  pointer 
//  and  the  simple  objects. 

Epetra_Map  ^pointer  =  new  Epetra_Map  ( NumGlobalElements  ,  0,  * epetra_comm  ) ; 
Epetra_Map  simple  (NumGlobalElements  ,  IndexBase  ,  *epetra_comm  ) ; 

//  The  simple  object  does  not  use  ’*  ’  and  no  call  to  new 
simple  .MaxMyGID () ;  //  Uses  instead  of 

delete  pointer;  //  We  do  not  delete  simple  because  it  is  not  a  pointer 
RCP<MAP>  tpetra_m  =  rep  (new  MAP( NumGlobalElements  ,  0,  tpetra_comm  ) ) ; 

//  Tpetra  constructors  look  more  like  epetra  with  pointers 
II  tpetra-m  is  deleted  automatically  when  it  goes  out  of  scope 

) 


A  few  notes  about  the  example  program.  We  use  header  files  to  easily  display  each 
portion  of  code  in  Examples  3-11  separately.  Normally,  a  program  would  not  be  setup  as 
independent  files  but  the  function  calls  remain  the  same.  Because  of  this,  we  need  to  use 
global  pointers  to  the  objects  so  that  they  are  available  to  each  method  in  different  header 
files.  Many  users  use  the  simple  constructor  rather  than  a  call  to  new;  Example  2  shows  the 
use  of  the  simple  constructor  on  line  25.  The  use  of  pointers  means  calls  to  function  will  use 
->function()  rather  than  .function(). 

2.1.  Define  types  to  clean  up  code  using  typedef.  Tpetra  introduces  a  number  of  tem¬ 
plate  parameters  needed  to  define  certain  classes.  Using  all  or  most  of  them  can  quickly 
make  the  code  unmanageable  and  hard  to  read.  The  use  of  typedef  can  make  the  code  easier 
to  write  and  also  more  readable  to  everyone  else.  This  guide  uses  typedef  for  the  scalar  and 
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ordinal  types  throughout  the  code.  Example  1  lists  this  simple  use  of  typedef  in  lines  8-10. 
In  addition,  each  of  the  basic  types  use  typedef  to  hide  the  parameters  and  make  the  code 
more  legible.  An  example  of  this  can  be  found  in  Example  5  on  line  4.  This  line  uses  type¬ 
def  to  shorten  the  declaration  of  the  Map  type  to  simply  MAP,  and  uses  the  passed  template 
parameters.  Lines  7  and  13  show  the  use  of  this  map  typedef. 


Example  3  ArrayTest.hpp:  Tpetra  uses  Array,  Array  View,  and  Array  RCP  in  place  of  raw 
pointers. 


void  ArrayTest  ( )  { 

Teuchos  ::  Array  <Teuchos  ::  Array  <Scalar  >  >  data  (NumVectors  ) ; 

Teuchos  :  :  Array  <Teuc  ho  s  : :  Array  View  <Scalar  >  >  data  View  (NumVectors  ) ; 

//  We  need  to  allocate  data  with  Teuchos  ::  Array ::  resize  ( ) 
for(size_t  i=0;  i  <  NumVectors;  i++){ 
data  [  i  ] .  resize  ( NumGlobalElements  ) ; 
dataView[i]  =  data  [  i  ] ; 

} 

//  But  we  need  to  pass  the  function  Teuchos ::  ArrayView 
tpetra_multivector  ->get2dCopy  (  dataView  ); 

//  View  methods  use  Teuchos ::  ArrayRCP  for  a  persisting  view 
Teuchos  ::  ArrayRCP<Teuchos  ::  ArrayRCP<Scalar  >  >  view 
=  tpetra_multivector  ->get2dViewNonConst  () ; 


2.2.  Teuchos  features  in  Tpetra.  Tpetra  makes  extensive  use  of  reference  counted 
pointers  (RCP)  as  defined  in  the  Teuchos  namespace  to  pass  objects.  Example  5,  line  13 
shows  the  use  of  RCP  for  both  the  constructor  of  the  Tpetra:  :Map  and  as  the  parameter  for 
passing  the  Teuchos:: Comm.  In  addition  to  passing  objects  for  construction,  RCPs  are  used 
in  many  of  the  get  methods  for  the  communicator  or  map  or  other  objects  as  the  pointer  for 
the  return  type.  For  example,  the  Tpetra:  :Map  function  getComm()  returns  an  RCP  to  the 
communicator.  Example  1  includes  the  necessary  header  file  and  lines  3  and  4  tell  the  com¬ 
piler  that  we  are  using  RCP  so  we  do  not  need  to  include  the  Teuchos  namespace.  Much  more 
information,  including  examples  on  RCPs  can  be  found  in  the  beginner’s  guide  [2]. 

Using  C  style  pointers  can  be  tedious  and  lead  to  memory  errors,  segmentation  faults,  or 
memory  leaks.  Tpetra  was  designed  to  avoid  these  problems  as  much  as  possible  by  using 
advanced  features  in  Teuchos  that  hide  much  of  the  tricky  coding  needed  for  passing  pointers. 
It  uses  Teuchos::  Array  View  and  Teuchos::  Array  RCP  in  much  the  same  way  that  double*  is 
used  in  Epetra.  For  users  familiar  with  the  standard  template  library  in  C++,  these  memory 
management  classes  should  be  easy  to  work  with. 

Example  3  shows  the  use  of  Teuchos::  Array  View  and  Teuchos::  Array  RCP  with  the  Tpe¬ 
tra:  :MultiVector  class.  The  code  here  is  called  after  the  multivectors  are  created,  but  shown 
now  to  demonstrate  the  use  of  ArrayView  and  ArrayRCP.  This  example  first  allocates  space 
and  gets  a  copy  of  the  data.  This  data  is  not  persistent,  meaning  that  if  the  multivector  val¬ 
ues  change,  the  copy  will  not  reflect  these  changes.  Next,  the  example  gets  a  view  of  the 
data  with  get2dViewNonConst.  This  view  is  persistent,  so  changes  will  be  reflected  in  the 
ArrayRCP.  The  example  first  creates  two  objects,  the  data  variable  is  an  array  of  arrays,  and 
the  copy  variable  is  an  array  of  array  views.  ArrayView  is  used  as  a  lightweight  wrapper  for 
the  array  class.  We  create  both  objects  because  Tpetra  needs  an  ArrayView,  but  the  space  is 
allocated  with  Array.  There  is  an  implicit  conversion  from  Array  to  ArrayView,  but  it  cannot 
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be  converted  from  within  the  Array.  The  loop  is  used  to  allocate  space  in  data,  and  then  copy 
is  used  to  point  to  the  allocated  space.  Finally  copy  is  passed  as  the  argument  to  the  method. 
Either  variable  can  be  used  from  this  point  on  to  access  the  data. 

Next,  Array  Test  shows  the  use  of  Teuchos: :  Array  RCP.  ArrayRCP  is  used  for  persisting 
views  of  data  in  Tpetra  classes.  Space  does  not  need  to  be  allocated  for  this  type  because  the 
data  it  points  to  already  exists. 

2.3.  LocalOrdinal  vs.  GlobalOrdinal  vs.  size_t.  This  section  explains  the  difference 
between  the  ordinal  types  used  in  Tpetra.  LocalOrdinal  is  used  when  referring  to  data  that  is 
on  this  processor.  GlobalOrdinal  is  used  for  data  that  could  be  on  any  processor.  For  example, 
the  return  type  for  Map::getGlobalElement  will  be  GlobalOrdinal  because  the  element  could 
be  on  any  processor.  This  function  returns  the  element  defined  in  a  global  sense  when  given 
its  local  index.  By  comparison,  Map::getMaxLocalIndex  is  LocalOrdinal  because  the  largest 
element  on  this  processor  must  be  within  the  bounds  of  LocalOrdinal.  The  index  of  the  largest 
local  element  also  has  a  global  equivalent  that  would  be  a  GlobalOrdinal  when  used  globally. 

The  cardinality  of  Tpetra  objects  are  described  by  either  global_size_t  or  size_t  depending 
on  if  the  cardinality  is  across  all  processor  or  just  on  the  local  processor.  Again  the  difference 
between  these  type  is  whether  or  not  the  elements  are  on-processor  or  not.  There  is  not  a  type 
local_size_t,  it  is  just  called  size_t.  An  example  in  CrsMatrix  is  global_size_t  is  returned  from 
getGlobalNumRows  because  they  are  on  any  processor  and  it  is  the  cardinality,  meaning  the 
number  of  rows.  The  local  size  equivalent  is  getNodeNumRows  which  returns  size_t. 

When  converting  code  from  Epetra  to  Tpetra,  it  is  often  helpful  to  use  the  context  of  how 
a  variable  is  used  to  determine  if  it  is  local  or  global.  It  might  not  be  immediately  obvious  by 
a  variable’s  name  or  simple  usage  if  it  should  be  global  or  local,  but  the  context  should  make 
it  apparent. 

3.  Epetra  vs.  Tpetra  Basic  Types.  Here  we  discuss  the  important  features  that  have 
changed  during  the  development  of  Tpetra.  Many  of  the  types  in  Tpetra  are  very  similar  to 
Epetra,  but  users  should  be  aware  of  what  has  changed.  This  guide  will  proceed  from  the 
most  basic  type,  communicators  in  3.1  to  the  more  complicated  types  of  matrices  and  import 
objects  in  3.4  and  3.5,  respectively.  Example  2  shows  the  include  statement  for  each  of  the 
following  examples  as  well  as  the  order  of  calls  to  functions. 

3.1.  Communicators.  The  first  critical  difference  between  Epetra  and  Tpetra  is  the  way 
objects  communicate.  Epetra  objects  use  an  Epetra_Comm  to  learn  their  rank  and  the  number 
of  processors  etc.  Tpetra,  on  the  other  hand,  does  not  use  an  Epetra_Comm,  but  a  commu¬ 
nicator  from  the  Teuchos  namespace.  Teuchos: :Comm  is  templated  using  an  Ordinal,  and  in 
most  cases  will  use  int.  Overall,  it  is  used  like  the  Epetra  version  and  mostly  used  just  for  the 
construction  of  other  objects. 

Creating  a  Teuchos: :Comm  object  is  different  than  just  declaring  it  either  Serial  or  MPI 
in  Epetra.  The  preferred  method  is  to  use  the  platform  in  Tpetra  to  create  the  communicator. 
A  communicator  could  be  instantiated  using  standard  constructors  and  determining  manually 
if  MPI  is  installed,  but  the  developers  recommend  using  the  related  function  to  create  the 
object. 

Note  that  Example  4  creates  the  communicator  in  a  reference  counted  pointer  in  line  24. 
The  Epetra  version  requires  the  user  to  determine  if  the  program  is  run  with  MPI  installed 
using  the  compiler  definitions  as  seen  in  lines  17-21,  and  this  is  not  needed  in  Tpetra.  Sim¬ 
ilarly,  we  no  longer  need  to  initialize  MPI  because  this  is  taken  care  of  in  the  background 
when  appropriate. 

3.2.  Maps.  Creating  maps  in  Tpetra  is  similar  to  Epetra.  The  primary  differences  are 
that  Tpetra  uses  RCP  as  seen  in  the  communicator  and  that  Tpetra:  :Map  uses  three  template 
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Example  4  CommTest.hpp:  Tpetra::  DefaultPlatform  determines  if  MPI  is  installed  or  not,  so 
we  do  not  need  to  use  compiler  flags. 


#include  " Epetra_Comm . h" 

# if def  EPETRA _MPI 
#include  "mpi . h" 

#include  " Epetra_MpiComm . h" 

#else 

#include  " Epetra_SerialComm . h" 

#endif 

#include  "Teuchos_Comm . hpp" 

# in elude  "Tpetra_DefaultPlatform . hpp" 

typedef  Teuchos  :  :  Comm<int  >  GQMM; 

const  Epetra.Comm*  epetra_comm ; 

RCP<const  OOMM>  tpetra.comm  ; 

void  CreateComm  ()  { 

#  if  def  EPETRA.MPI 

epetra.comm  =  new  Epetra.MpiComm  (MPI_COMM_WORLD ) ; 

#else 

epetra.comm  =  new  Epetra.SerialComm  ( ) ; 

#endif 

//  This  determines  if  MPI  is  installed  or  not 

tpetra_comm  =  Tpetra  ::DefaultPlatform::getDefaultPlatform().  getComm  ( ) ; 

} 


Example  5  MapTest.hpp:  Use  the  typedef  to  simplify  the  constructor  and  it  is  easy  to  create 
the  map. 


#include  " Epetra_Map . h" 

#include  "Tpetra_Map . hpp" 

typedef  Tpetra  ::  Map<LO,  GO>  MAP; 

Epetra_Map*  epetra_map  ; 

RCP<MAP>  tpetra_map  ; 

GO  NumGlobalElements  -  10; 

LO  IndexBase  =  0; 

void  CreateMap(){ 

epetra_map  =  new  Epetra_Map  ( NumGlobalElements  ,  IndexBase,  *  epetra_comm  ) ; 
tpetra.map  =  rep  (new  MAP(  NumGlobalElements  ,  IndexBase,  tpetra.comm  ) ) ; 

} 


parameters.  The  first  is  the  local  ordinal,  the  whole  number  system  used  for  indices  on  the 
local  processor.  The  second  is  the  global  ordinal,  which  is  similar  except  used  to  index  el¬ 
ements  on  any  processor.  By  default  the  global  ordinal  is  set  to  the  same  type.  Finally, 
Tpetra:  :Map  uses  a  node  parameter,  which  gives  the  ability  to  use  different  computing  plat¬ 
forms.  In  most  cases,  users  will  allow  this  parameter  to  take  the  default  value  that  is  set  by 
Kokkos:DefaultNode.  The  default  value  is  determined  based  on  which  platform  is  installed 
with  Trilinos.  Example  5  shows  the  construction  of  the  maps  with  integer  ordinals  and  the 
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default  node  platform  chosen. 

The  design  of  Tpetra::Map  is  different  than  Epetra  because  the  BlockMap  class  was  not 
carried  over.  Most  of  the  methods  of  Tpetra::Map  are  used  like  they  are  in  Epetra.  The 
Epetra  function  NumGlobalElements()  for  example  is  getGlobalNumElements()  in  Tpetra. 
The  noteworthy  differences  are  that  Tpetra  uses  keywords  like  get  and  set  to  specify  if  it  is 
retrieving  or  setting  parameters  and  also  that  the  first  letter  of  methods  are  not  capitalized  in 
Tpetra  as  a  standard.  Another  difference  is  that  the  parameter  types  stem  from  the  parameter 
types  that  were  set  on  creation.  So,  the  return  type  of  global  element  counts  is  global_size_t 
and  similarly  the  maximum  global  element  would  return  type  GlobalOrdinal.  This  difference 
will  often  not  be  noticeable  to  users,  but  does  allow  for  greater  flexibility  and  makes  more 
obvious  when  a  return  type  refers  to  the  global  or  local  space. 


Example  6  VectorTest.hpp:  The  constructor  and  methods  of  the  vector  type  are  similar  to 
Epetra. 


#include  " Epetra_Vector . h" 

# in elude  " Epetra_Multi Vector . h" 

#include  "Tpetra_Vector . hpp" 

# in elude  "Tpetra_Multi Vector . hpp" 

typedef  Tpetra  ::  Vector <Scalar  ,  LO,  GO>  VEC; 
typedef  Tpetra  ::  MultiVector <Scalar  ,  LO,  GO>  MV; 

Epetra.Vector  *  epetra.vector  ; 

Epetra.Multi  Vector  *  epetra.multivector  ; 

RCP<VEC>  tpetra.vector  ; 

RCP<MV>  tpetra.multivector  ; 

Epetra.Vector  *  epetra.lhs  ; 

RCP<VEC>  tpetra.lhs ; 
size.t  NumVectors  =  1; 

void  CreateVec(){ 

epetra.vector  =  new  Epetra.Vector  (*  epetra.map  ) ; 

epetra.multi  vector  =  new  Epetra.Multi  Vector  (*  epetra.map  ,  NumVectors); 
epetra_lhs  =  new  Epetra_Vector  (*  epetra_map  ) ; 

tpetra_vector  =  rcp(new  VEC(  tpetra_map  ) ) ; 
tpetra_multivector  =  rep  (new  MV(  tpetra_map  ,  NumVectors)); 
tpetra_lhs  =  rep  (new  VEC(  tpetra_map  ) ) ; 

} 


void  RandomizeVec  ( )  { 

epetra_vector  ->Random  ( ) ; 

//  simple -vector  .  Random  ( )  ; 

epetra_multivector  ->Random  ( ) ; 
tpetra_vector  ->randomize  () ; 
tpetra_multivector  ->randomize  ( ) ; 


3.3.  Vectors  and  MultiVectors.  Tpetra:: Vector  introduces  one  more  template  parame¬ 
ter  that  needs  to  be  set  upon  creation.  It  uses  a  scalar  template  for  the  storage  of  each  value 
within  the  vector  or  multivector.  This  allows  the  user  to  specify  that  float  would  be  used 
instead  of  double  for  example.  The  float  type  may  be  very  useful  in  GPU  programming  be¬ 
cause  they  operate  much  faster  with  the  32-bit  type  compared  to  the  64-bit  type.  Besides  this 
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extra  template,  creating  a  Tpetra:: Vector  or  Tpetra: :MultiVector  follows  from  the  pattern  of 
the  communicator  and  map.  Example  6  creates  the  vector  with  zeroed-out  values  and  the 
map  defined  earlier.  It  also  uses  local  and  global  ordinals  of  integer  type  and  the  scalar  is  the 
double  precision  type.  The  function  Randomize Vec()  shows  a  simple  function  call  that  sets 
all  values  to  random  numbers.  The  only  difference  between  Epetra  and  Tpetra  are  the  case  of 
the  first  letter  and  the  conjugation  of  the  verb. 

Another  difference  between  Epetra  and  Tpetra  is  the  use  of  array  utilities  in  Teuchos 
in  place  of  pointers.  Section  2.2  discusses  this  change  in  more  detail.  For  the  Vector  and 
MultiVector  classes  this  means  that  instead  of  setting  values  using  a  double-pointer,  we  can 
use  a  Teuchos::  Array  View  of  any  templated  type. 


Example  7  MatrixTest.hpp:  There  is  very  little  difference  in  creating  a  CrsMatrix  in  Epetra 
and  Tpetra. 


1  #include  " Epetra_CrsMatrix . h" 

# in elude  "Tpetra_CrsMatrix . hpp" 

3 

typedef  Tpetra  ::  CrsMatrix <Scalar  ,  LO,  GQ>  MAT; 

5  Epetra.CrsMatrix  *  epetra.matrix  ; 

RCP<MAT>  tpetra.matrix  ; 

7  size_t  NumEntriesPerRow  =  3; 

9  void  CreateMat(){ 

epetra_matrix  =  new  Epetra_CrsMatrix 
11  (Copy,  *epetra_map  ,  *epetra_map  ,  NumEntriesPerRow); 

tpetra_matrix  =  rep  (new  MAT(  tpetra_map  ,  tpetra_map  ,  NumEntriesPerRow)); 

13  } 


3.4.  Matrices.  Matrices  in  Tpetra  are  most  commonly  used  with  a  compressed  row  stor¬ 
age  matrix  format,  this  storage  format  is  implemented  in  Tpetra:  :CrsMatrix.  Like  the  Epe¬ 
tra  version,  the  Tpetra  matrix  uses  the  pure  virtual  interface  RowMatrix  to  specify  methods 
needed  for  the  matrix.  RowMatrix  inherits  Operator  just  like  Epetra  which  allows  it  to  be  used 
as  the  linear  operator  in  linear  systems.  Example  7  creates  the  Epetra  and  Tpetra  CrsMatrix 
objects.  The  Epetra_DataAccess  enumerated  type  was  not  carried  over  to  the  Tpetra  version; 
it  is  assumed  that  the  data  will  be  copy  mode.  This  design  decision  is  the  result  of  the  need 
for  compatibility  with  CUDA  GPU  programming.  GPU  programming  requires  that  the  data 
be  copied  over  to  the  GPU  memory,  so  a  view  of  it  is  not  possible  in  this  case.  Otherwise, 
they  are  created  almost  identically  with  the  domain  and  range  maps  (RCP  in  Tpetra)  and  the 
number  of  entries  per  row. 

Example  8  fills  the  matrix  in  the  method  SetMat().  The  main  difference  between  Epetra 
and  Tpetra  when  filling  a  matrix  is  the  use  of  Teuchos::  Array  View  instead  of  the  raw  pointer. 
The  Teuchos:: Array  cols  is  like  the  integer  pointer  Indices.  Note  that  the  Array  does  not  call 
new,  and  then  does  not  need  to  be  deleted.  The  other  change  is  the  order  of  parameters  to 
insert  the  matrix  values.  The  indices  now  come  before  the  values  and  size  of  the  vector  is  not 
passed. 

The  SetMat()  function  also  shows  how  the  maps  can  be  used  to  get  the  number  of  ele¬ 
ments  and  go  from  local  to  global  indices.  The  Epetra_BlockMap  GID()  function  no  longer 
has  a  shorthand,  and  instead  uses  getGlobalElement().  An  example  of  this  is  lines  10-11, 
which  use  equivalent  commands  for  Tpetra  and  the  commented  version  of  Epetra. 
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Example  8  SetMat.hpp:  This  method  shows  how  to  fill  the  matrix  and  a  few  map  methods. 
Use  the  Teuchos::  Array  class  instead  of  raw  pointers  and  we  no  longer  need  to  delete  it. 

void  SetMat(){ 

int*  Indices  =  new  int  [ NumEntriesPerRow  ]  ; 
double*  Values  =  new  double  [ NumEntriesPerRow  ] ; 

Teuchos  :  :  Array  <LO>  cols  (NumEntriesPerRow  ) ; 

Teuchos  :  :  Array  <Scalar  >  val s  (NumEntriesPerRow  ) ; 
for  (GO  i  =  0;  i  <  tpetra_map  ->getNodeNumElements  ( ) ;  i+  +  ){ 

//for(int  i  =  0;  i  <  eptra.map  ->NumMyElements  ( ) ;  i++){ 

cols . clear ();  cols . re  size (3 ) ;  vals . clear () ;  vals . resize (3  ) ; 
if  ( tpetra.map  ->getGlobalElement  ( i )  ==  0){ 

/ /  if  (  epetra.map —>GID(  i )  ==  0){ 

Indices  [0]  =  i;  Indices[l]  =  i+1; 

Values  [0]  =  4;  Values  [1]  =  -1; 

epetra.matrix  ->InsertMy  Values  ( i  ,  NumEntriesPerRow  - 1 ,  Values,  Indices); 

cols  [0]  =  i;  cols  [1]  =  i  +  1; 

vals  [0]  =  4;  vals  [1]  =  -1; 

tpetra.matrix  ->insertLocal  Values  (i  ,  cols,  vals); 

}  else  if  ( tpetra.map  ->getGlobalElement  ( i )  ==  NumGlobalElements  )  { 

Indices  [0]  =  i-1;  Indices  [1]  =  i; 

Values  [0]  =  -1;  Values  [1]  =  4; 

epetra_matrix  ->InsertMy  Values  ( i  ,  NumEntriesPerRow  - 1 ,  Values,  Indices); 

cols  [0]  =  i-1;  cols  [1]  =  i; 
vals [0]  =  -1;  vals [1]  =4; 

tpetra_matrix  ->insertLocal  Values  (i  ,  cols,  vals); 

}  else  { 

Indices  [0]  =  i-1;  Indices  [1]  =  i;  Indices  [2]  =  i  +  1; 

Values  [0]  =  -1;  Values  [1]  =  4;  Values  [2]  =  -1; 

epetra_matrix  ->InsertMy  Values  ( i  ,  NumEntriesPerRow,  Values,  Indices); 

cols  [0]  =  i-1;  cols  [1]  =  i;  cols  [2]  =  i  +  1; 

vals  [0]  =  -1;  vals[l]  =  4;  vals  [2]  =  -1; 

tpetra_matrix->insertLocalValues(i,  cols,  vals); 


epetra_matrix->FillComplete  ();  tpetra_matrix->fillComplete  (); 
delete  []  Indices  ;  delete  []  Values  ; 


3.5.  Import  and  Export.  The  Import  and  Export  classes  are  used  to  move  data  between 
this  processor  and  other  processors.  Its  important  function  is  to  translate  between  the  source 
and  target  maps.  Importing  and  exporting  can  be  done  to  any  object  that  inherits  from  the 
DistObject  base  class.  Example  9  shows  the  creation  of  Import  and  Export  objects  in  both 
Epetra  and  Tpetra.  It  also  includes  a  function  that  shows  how  the  primary  function  is  called 
from  any  object  that  inherits  from  DistObject. 

We  use  the  multivector  to  show  that  Import()  in  Epetra  has  become  doImport()  in  Tpetra. 
As  the  code  says,  this  call  is  useless  because  the  source  and  target  maps  in  the  Import  object 
are  the  same.  The  important  thing  to  note  is  that  the  enumerated  type  for  CombineMode  has 
changed.  Previously  the  options  were:  Add,  Zero,  Insert,  InsertAdd,  Average,  and  AbsMax. 
In  Tpetra  the  only  options  are:  ADD,  INSERT,  and  REPLACE.  Additionally,  the  namespace 
is  required  as  seen  in  the  example.  When  creating  the  Import  or  Export  objects,  the  Tpetra 
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Example  9  ImportTest.hpp:  Creating  import  and  export  objects  is  almost  identical  in  Tpetra 
and  Epetra. 


#include  "Epetra_Import . h" 

#include  " Epetra_Export . h" 

#include  "Tpetra_Import . hpp" 

#include  "Tpetra_Export . hpp" 

typedef  Tpetra  ::  Import <LO,  GO>  IMP; 
typedef  Tpetra  ::  Export <LO,  GO>  EXP; 

Epetra.Import *  epetra.import  ; 

Epetra.Export  *  epetra.export  ; 

RCP<IMP>  tpetra.import  ; 

RCP<EXP>  tpetra.export  ; 

void  Createlmp  ()  { 

epetra.import  =  new  Epetra.Import  (*  epetra.map  ,  *  epetra.map  ) ; 
epetra.export  =  new  Epetra.Export  (*  epetra.map  ,  *  epetra.map  ) ; 

tpetra.import  =  rcp(new  IMP(  tpetra.map  ,  tpetra.map  ) ) ; 
tpetra.export  =  rep  (new  EXP(  tpetra.map  ,  tpetra.map  ) ) ; 

} 


void  Uselmp(){ 

//  This  code  doesn’t  do  anything  except  show  the  function  call 
epetra_multivector ->Import  (*  epetra_vector  ,  *  epetra_import  ,  Insert); 


tpetra_multivector  ->doImport 

(* tpetra_vector  ,  *  tpetra_import 


Tpetra  ::  INSERT); 


version  takes  the  parameters  in  the  order:  SourceMap,  TargetMap,  which  is  the  opposite  of 
the  Epetra  version. 

4.  Solving  linear  systems  with  Tpetra.  We  created  the  vectors  and  matrices  so  that 
we  can  solve  linear  systems.  Stratimikos  gives  the  ability  to  easily  customize  the  solver 
and  preconditioner  in  many  ways.  Stratimikos  is  a  new  package  to  unify  the  various  solvers 
and  preconditioners  within  the  Trilinos  framework.  It  does  all  the  creation  of  the  solver  and 
preconditioner  in  the  background  based  on  the  parameters  specified  in  a  simple  XML  file. 
This  section  describes  the  process  of  wrapping  objects  in  Thyra  and  the  steps  to  use  the 
Stratimikos  solver. 

4.1.  Wrap  the  Tpetra  objects  in  Thyra.  The  Stratimikos  solver  described  in  Sec¬ 
tion  4.2  allows  the  user  to  easily  create  a  Thyra  solver  with  specific  parameters  chosen.  In 
order  to  use  this  effectively,  we  need  to  wrap  the  Tpetra  objects  so  Thyra  can  operate  on 
them  as  if  they  were  any  Operator  of  VectorSpace  objects.  Thyra  recommends  using  the 
create  [Object]  functions  as  seen  in  Example  10  to  simplify  the  calls.  The  file  Thyra_Tpetra- 
ThyraWrappers.hpp  provides  these  functions  to  easily  wrap  the  Tpetra  objects.  This  code 
creates  a  Thyra:: VectorSpace,  Thyra:: Vector,  and  Thyra: :LinearOp  to  be  used  as  map,  vec¬ 
tor,  and  operator  respectively.  Each  of  these  is  necessary  to  solve  the  linear  system  in  Stra¬ 
timikos.  For  more  information  about  the  Thyra  package,  see  the  guide  about  its  operators  and 
vectors  [1]. 
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Example  10  ThyraTest.hpp:  Wrapping  the  Tpetra  objects  in  Thyra  allows  the  use  of  generic 
algorithms  to  use  with  the  solver. 


#include  "Thyra_TpetraLinearOp . hpp" 

# in elude  "Thyra_TpetraVector . hpp" 

#include  "Thyra_VectorSpaceBase .hpp" 

#include  "Thyra_TpetraThyraWrappers .hpp" 

typedef  Kokkos  :  :  DefaultNode  :  :  DefaultNodeType  Node; 
typedef  Thyra  ::  VectorSpaceBase  <Scalar  >  ThyraVS  ; 
typedef  Thyra  ::  VectorBase  <Scalar  >  Thyra VEC  ; 
typedef  Thyra  ::  LinearOpBase  <Scalar  >  ThyraOP  ; 

RCP<const  ThyraVS>  thyra.vs  ; 

RCP<ThyraVEC>  thyra.rhs  ; 

RCP<ThyraVEC>  thyra.lhs  ; 

RCP<ThyraOP>  thyra.op  ; 

void  WrapThyra  ( )  { 

//  Wrap  the  map  as  a  vector  space 

thyra.vs  =  Thyra  ::  create VectorSpace  <Scalar  ,  LO,  GO,  Node> 
( tpetra.map  ) ; 

//  Wrap  the  vectors  for  Thyra 

thyra.lhs  =  Thyra  ::  createVector  ( tpetra.lhs  ,  thyra.vs); 
thyra_rhs  =  Thyra ::  createVector ( tpetra_vector  ,  thyra_vs); 
//  Wrap  the  matrix  as  a  linear  operator 
thyra_op  =  Thyra  ::  createLinearOp  <Scalar  ,  LO,  GO,  Node> 

( tpetra_matrix  ,  thyra_vs  ,  thyra_vs); 


4.2.  The  Stratimikos  solver.  The  use  of  the  AztecOO  solver  with  the  Epetra  matrices 
and  vectors  will  look  familiar  to  experienced  Epetra  users.  We  show  the  most  basic  call  to 
the  AztecOO  solver  that  creates  it  with  default  parameters  and  no  preconditioner.  With  no 
options,  we  have  a  basic  solver  that  will  only  work  with  simple  linear  systems.  A  real  code 
would  set  options  with  SetAztecOption()  to  make  the  solver  more  elaborate  and  be  able  to 
solve  any  complicated  system. 

The  best  way  to  create  a  Stratimikos  solver  is  using  the  default  builder.  This  method  takes 
the  solver  parameters  as  an  XML  file.  The  only  parameter  that  should  be  passed  is  the  name 
of  the  XML  parameter  file.  We  can  pass  the  solver  builder  a  number  of  different  solvers  and 
preconditioners,  for  example  Belos,  AztecOO,  or  Amesos  and  Ifpack  for  the  preconditioner. 
All  the  options  for  the  XML  file  and  format  can  be  found  in  the  Trilinos  documentation  [3]. 

Example  1 1  lists  the  steps  used  to  get  the  solver  from  Stratimikos  and  call  the  solve() 
method.  Line  23  creates  the  Stratimikos  default  solver  from  the  parameters  chosen  in  the 
XML  file.  The  next  two  lines  are  needed  to  force  the  solver  builder  to  read  the  parameters  that 
we  passed  to  it.  Then  we  get  a  Thyra:  :LinearOpWithSolve-Lactory  from  the  solver  builder. 
This  is  next  given  the  Thyra:  :LinearOp  that  was  wrapped  in  Example  10.  Linally,  a  simple 
call  to  solve()  passes  the  right  and  left  hand  side  Thyra  vectors.  This  will  return  a  SolveStatus 
that  can  be  used  to  check  for  convergence  and  has  a  print  method  that  can  be  used  to  easily 
verify  the  result  of  the  call  to  solve. 

5.  Conclusions.  Tpetra  is  a  powerful  new  package  in  Trilinos  that  gives  users  the  ability 
to  use  multi-precision  algorithms  and  different  computing  platforms.  Tpetra  makes  extensive 
use  of  Teuchos  utilities  to  help  in  memory  management.  RCP  is  used  for  garbage  collection 
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Example  11  SolveTest.hpp:  The  parameters  are  passed  to  the  solver  with  the  XML  file,  then 
it  is  easy  to  call  solve  on  our  solver. 

#include  "AztecOO.h" 

# in elude  " Stratimikos_DefaultLinearSolverBuilder . hpp" 

#include  "Thyra_LinearOpWithSolveFactoryBase .hpp" 

#include  "Thyra_LinearOpWithSolveBase .hpp" 

# in elude  "Thyra_LinearOpWithSolveFactoryHelpers . hpp" 

using  Stratimikos  :  :  Def aultLinearS olverB uilder  ; 
using  Teuchos  :  :  VerboseObjectBase  ; 

RCP<Def  aultLinearS  olverB  uilder  >  tpetra.solver  ; 

RCP<Thyra  :  :  LinearOpWithSolveFactoryBase  <Scalar  >  >  lows  .factory  ; 

RCP<Thyra  :  :  LinearOpWithSolveBase  <Scalar  >  >  lows; 

Thyra  :  :  SolveStatus<Scalar>  status  ; 

AztecOO*  epetra.sol ver  ; 
int  Maxlters  =  100; 
double  Tolerance  =  le-6; 

void  Solve ( ) { 

epetra.sol  ver  =  new  AztecOO  (  epetra.matrix  ,  epetra.lhs  ,  epetra.vector  ) ; 
epetra.s  ol  ver  ->Iterate  (  Maxlters  ,  Tolerance); 

tpetra _s ol ver  =  rep (new  Def  aultLinearS olverB  uilder ( "Parameter s . xml ")) ; 

RCP<Teuchos  :  :  FancyOStream>  out  =  VerboseObjectBase  ::  getDefaultOStream  () ; 
tpetra.solver  ->readParameters  (out .  get  ()); 

lows.factory  =  tpetra.solver ->createLinearSolveStrategy  ("") ; 
lows  =  Thyra  ::  linearOpWithSolve  <Scalar  >(*  lows  .factory  ,  thyra.op); 
status  =  lows ->  solve  (Thyra  :  :NOTRANS,  *thyra_rhs,  thyra.lhs  .  ptr  ()) ; 


and  Array  utilities  wrap  simple  pointers  to  help  avoid  memory  leaks.  Tpetra  uses  the  types 
LocalOrdinal,  GlobalOrdinal,  size_t,  and  global_size_t  to  distinguish  between  types  of  ordi¬ 
nals  that  can  be  passed.  The  basic  types  in  Tpetra  are  all  quite  similar  to  those  in  Epetra. 
Using  the  typedef  keyword,  we  can  hide  much  of  the  templating  and  make  the  code  easier  to 
read. 

Thyra  and  Stratimikos  give  users  the  ability  to  solve  linear  systems  abstractly  from  any 
type  of  matrix  and  vectors.  We  wrap  our  Tpetra  objects  in  Thyra  and  then  can  easily  create 
the  solver  from  an  XML  file.  Using  the  solver  is  simple  after  it  is  created.  Work  will  continue 
on  Tpetra  and  Stratimikos  to  give  users  even  more  flexibility  and  options  for  matrix  types  and 
solver  options. 
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TESTING  ENGINEERING  SOFTWARE:  DEVELOPMENT  AND  TESTING  OF 

THE  CUBIT  PROGRAM 

STUDENT  BRYAN  J.  HARDY*  AND  MENTOR  BRETT  W.  CLARlU 

Abstract.  Testing  is  essential  to  the  software  development  process.  As  active  development  is  extended  in  a 
program  and  increasing  amounts  of  code  and  algorithms  combine,  the  probability  that  something  may  be  missed 
also  increases.  Thus  regular  testing,  both  automated  and  human,  must  be  done.  This  testing  ensures  that  new 
changes  in  one  area  are  accounted  for  and  work  properly,  and  don’t  negatively  affect  code  in  another  area.  The 
Python  interface  is  a  newer  portion  of  Cubit  that  was  found  in  need  of  testing.  This  work  found  a  lot  of  capability 
that  could  be  employed,  and  a  lot  of  documentation  adjustments  needed  to  make  that  capability  useful.  This  paper 
discusses  the  testing  of  both  the  outward  user  commands  entered  into  Cubit  and  some  of  the  internal  data  retrieval 
and  modification  functions  accessed  by  Python.  It  looks  at  the  effort  to  ensure  Cubit  functionality,  make  the  Python 
functionality  more  apparent,  and  automate  a  regular  testing  of  the  Python  interface.  Increasing  the  scope  of  testing 
yields  more  complete  code  coverage  and  increased  utility  for  the  end  user.  Current  and  future  user  metrics  are  also 
discussed. 


1.  Introduction.  Cubit  is  a  finite  element  preprocessor  for  automated  mesh  generation. 
It  has  many  features  and  algorithms  for  producing  2D  and  3D  meshes  for  both  triangular/te¬ 
trahedral  elements  and  quadrilateral/hexahedral  elements.  It  “is  designed  to  provide  the  user 
with  a  powerful  toolkit  of  meshing  algorithms  that  require  varying  degrees  of  input  to  pro¬ 
duce  a  complete  finite  element  model ...  The  overall  goal  of  the  CUBIT  project  is  to  reduce 
the  time  it  takes  a  person  to  generate  an  analysis  model.”  [2]  Cubit  is  used  to  generate  finite 
element  meshes  for  geometric  models  that  either  came  from  a  CAD  modeler  or  which  were 
generated  within  Cubit  itself.  Boundary  conditions  can  also  be  defined  and  exported  to  fully 
prepare  the  model  for  analysis  outside  of  Cubit. 

Cubit  is  a  command  driven  program.  A  GUI  has  also  been  implemented  in  the  code  to 
increase  functionality,  but  most  of  the  GUI  functions  are  simply  relaying  commands  to  the 
command  line.  Many  different  commands  and  functions  have  been  developed  to  accomplish 
the  ends  of  Cubit,  and  as  they  are  developed,  tests  are  created  and  executed  to  ensure  that 
the  desired  functionality  works  as  intended.  The  current  version  of  Cubit  is  12.1.  Due  to  the 
many  years  of  development  and  continued  work  on  the  Cubit  project,  there  is  a  lot  of  code  to 
be  tested  and  maintained  to  ensure  robustness  in  performance. 

Through  this  paper,  we  will  describe  the  current  testing  setup  and  approaches.  We  will 
also  delve  more  specifically  into  the  testing  of  Cubit’s  Python  interface.  Future  testing  direc¬ 
tion  and  methods  will  also  be  discussed. 

2.  General  Cubit  Testing.  As  Cubit  has  been  developed,  a  substantial  test  suite  has 
been  created  along  with  it.  The  tests  consist  of  journal  files  (files  of  Cubit  commands)  that 
can  be  run  in  Cubit.  Thus,  by  setting  up  the  tests  appropriately,  certain  parts  of  the  code  can 
be  checked  to  ensure  that  they  are  working  as  desired.  Cubit  reports  errors  that  can  be  used 
to  determine  passage  or  failure  of  a  test.  With  this  test  suite  in  place,  developers  can  access 
and  run  tests  on  their  new  code  to  see  if  it  changes  the  outcome  for  other  parts  of  the  code. 
The  entirety  of  the  test  suite  (which  consists  of  about  185  tests,  each  with  several  cases)  is 
run  each  night,  consistently  checking  the  code  as  it  is  developed  each  day. 

Many  tests  arise  in  association  with  new  features  and  new  code  developed,  but  there  are 
also  many  tests  that  are  implemented  in  response  to  finding  bugs  in  the  program.  A  bug  is 
found  by  manual  testing,  a  fix  is  implemented,  and  then  a  test  is  created  to  make  sure  that  the 
bug  does  not  surface  again. 
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Bugs  are  found  primarily  by  3  means:  by  developers  encountering  them  as  they  create 
new  code,  by  intentionally  testing  the  code  and  looking  for  bugs,  or  by  users  while  running 
Cubit.  User  reports  are  extremely  useful  as  they  reflect  problems  encountered  in  real  world 
application  of  Cubit.  Unfortunately,  a  user  finding  the  problem  is  the  least  desirable  method, 
as  that  indicates  the  product  was  not  fully  prepared  before  it  was  released  for  use.  Thus, 
efforts  are  made  before  hand  to  manually  test  Cubit  and  ensure  stability  in  the  code. 

One  testing  method  for  finding  bugs  is  a  cold  call  type  method.  This  entails  a  general 
use  and  abuse  of  Cubit  in  whatever  way  the  user  decides.  This  is  not  the  most  effective  way 
to  test,  but  it  is  a  good  idea  to  employ  occasionally.  It  can  cover  a  wide  range  of  features 
and  employ  some  of  the  lesser  used  (and  thus  less  tested)  features.  However,  this  method  can 
also  lead  to  finding  ridiculous  bugs  in  the  program,  which  are  somewhat  less  useful.  Abuse 
in  testing  is  good  to  check  robustness  in  the  code,  but  there  is  a  point  at  which  it  can  become 
overkill  and  inefficient.  Covering  every  possible  case  of  how  someone  may  use  a  feature 
inappropriately  would  be  a  long  and  not  so  fruitful  process.  Thus  in  testing  efforts,  one  must 
separate  the  ridiculous  from  the  useful,  making  a  note  of  the  ridiculous  when  found,  but 
placing  emphasis  on  finding  the  bugs  in  the  more  commonly  trod  paths.  Bugs  are  not  to  be 
ignored  and  forgotten,  no  matter  how  ridiculous  (as  it  may  reveal  some  other  more  important 
underlying  flaw),  but  priorities  should  be  established  to  allocate  time  and  resources  to  fix 
bugs. 

Another  method  is  a  targeted  type  approach.  This  involves  testing  one  particular  area 
as  thoroughly  as  possible,  trying  the  full  range  of  commands  and  cases,  and  the  full  range 
of  inputs  and  values.  The  target  can  be  selected  either  by  user  feedback  (areas  in  which 
they  have  had  problems)  or  by  developer  feedback  (areas  which  they  have  tinkered  in  or 
developed  recently).  When  the  code  to  be  tested  is  brand  new  code,  there  are  potentially 
more  unhandled  cases  and  forgotten  details  that  could  be  found  and  dealt  with.  When  the 
code  is  more  established,  better  possible  functionality  could  be  noticed  and  suggested  as  well 
as  finding  discrepancies  in  how  the  functions  should  work  and  what  actually  occurs. 

One  key  element  to  testing  is  that  when  you  encounter  a  bug  (either  yourself  or  as  re¬ 
ported  by  someone  else),  you  need  to  find  a  way  to  reproduce  it  simply  and  reliably.  A  list  of 
one  hundred  commands  that  causes  a  crash  or  an  error  doesn’t  help  a  whole  lot  for  the  efforts 
of  fixing  it.  Reducing  the  commands  to  the  simplest  possible  set  that  will  still  produce  the 
bug  enables  greater  ease  to  find  the  problem  within  the  code.  Also,  a  report  of  a  bug  is  not  of 
much  use  when  the  bug  can’t  be  reproduced.  Often  times  one  may  find  a  bug  but  not  be  able 
to  remember  how  they  got  there.  Thus  we  know  that  something  is  wrong,  but  have  no  way 
to  fix  it.  Isolating  the  offending  code  enables  fixes  to  be  implemented  in  a  more  timely  and 
efficient  manner.  Shaving  off  the  time  it  takes  to  bullet  proof  the  code  leaves  more  time  for 
further  development  and  progression. 

Since  Cubit  is  driven  by  commands,  the  best  way  to  reproduce  bugs  is  by  finding  the 
smallest  number  of  commands  issued  that  show  the  result.  Having  a  GUI  included  in  the  code 
changes  things  a  little.  Now,  the  program  can  be  sending  and  receiving  information  without 
executing  commands.  In  Cubit,  there  are  panels  in  the  GUI  that  provide  a  nice  interface 
to  create  commands  and  then  send  them  to  the  command  line.  There  are  other  parts  of  the 
GUI  that  retrieve  information  or  do  other  functions.  A  GUI  is  built  to  interact  with  a  human, 
which  increases  the  difficulty  of  automating  testing.  Thus,  most  GUI  testing  is  manual.  Cubit 
has  many  interactive  GUI  setups  and  display,  including  toolbars,  a  tree  navigator,  an  entity 
property  page,  and  the  Graphics  window.  Thus  with  data  being  transferred  and  displayed 
in  a  variety  of  ways,  there  is  always  room  for  error  that  automated  testing  would  not  be 
able  to  distinguish.  Testing  the  GUI  could  include  haphazard  button  pushing  (making  sure 
the  program  can  handle  all  the  GUI  events),  but  a  more  controlled  look  at  each  interface  is 
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needed.  This  testing  would  need  to  be  done  with  any  GUI  development,  but  perhaps  not  so 
often  as  the  automated  tests  (again,  efficiency). 

3.  Testing  Cubit’s  Python  Interface.  The  Python  interface  within  Cubit  is  a  unique 
portion  of  the  program.  Python  is  an  object  oriented  scripting  language.  Cubit  is  a  command 
driven  program,  but  has  a  Python  interface  that  can  act  as  a  driver  for  it.  With  Python  enabled 
to  interface  with  Cubit,  we  can  now  script  various  commands  and  use  features  such  as  loops, 
etc.  to  create  more  elegant  and  complicated  command  sets.  It  is  also  used  to  expose  and 
employ  methods  from  the  underlying  C++  code  to  the  user.  Its  easy  customization  “provides 
a  method  of  delivering  code  to  specific  users  needing  functionality  without  having  to  wait  for 
an  entire  release  cycle.  It  also  provides  a  method  of  delivering  code  to  users  that  is  of  interest 
only  to  that  user  and  is  not  of  benefit  to  the  general  user  base.”[l]  This  makes  Cubit  much 
more  versatile  for  many  models,  problems,  and  applications. 

This  interface  was  previously  tested  only  to  a  degree.  It  is  a  relatively  new  component 
with  extensive  capabilities,  so  upon  its  implementation,  the  full  range  of  rigorous  testing  was 
not  completed.  Its  differences  also  prevent  traditional  tests  from  being  added  to  the  test  suite 
readily  (those  tests  consisting  of  Cubit  commands,  not  Python  code).  A  primary  objective  for 
this  testing  was  to  ensure  the  setup  was  made  such  that  .py  files  (uncompiled  Python  files) 
could  be  used  in  an  automated  fashion  and  report  back  the  result  of  the  test. 

The  versatility  of  Python  presents  a  challenge  for  testing  and  robustness.  The  more 
you  can  do,  the  more  potential  there  is  for  problems  to  be  unaccounted  for.  Thus  some 
testing  needs  to  be  done  to  see  if  basic  scripting  ability  is  present  (loops,  defined  methods, 
data  structures,  etc)  and  to  what  extent  within  the  program.  Cubit  intends  to  have  the  full 
capability  of  a  scripting  language  through  Python.  Methods  can  be  defined  that  would  help 
automate  processes  within  Cubit.  Combining  with  loops  could  automate  even  more,  or  such 
methods  could  be  made  available  for  the  user  to  adjust  and  employ  as  needed.  With  the 
various  capabilities  referred  to,  “we  have  found  the  scripting  interface  to  [be]  very  useful”  [1] 
for  providing  the  unique  functionality  it  intends  to  expose. 

Python  scripting  capability  and  interface  must  be  tested  within  Cubit  to  ensure  usabil¬ 
ity.  The  generic  functionality  testing  has  been  done  previously  with  minor  nuances  found. 
Python  is  fully  functional  within  Cubit,  and  Cubit  is  fully  functional  as  a  module  imported 
into  Python.  With  Python  appropraitely  set  up  to  work  with  Cubit,  the  methods  defined  for  the 
Cubit  module  also  must  have  their  functionality  ensured.  These  methods  are  defined  in  C++ 
and  work  within  the  Cubit  GUI,  but  have  been  exposed  to  the  user  and  Python  via  SWIG.  The 
majority  of  these  methods  are  simply  data  retrieval  methods.  That  is,  they  return  information, 
but  do  not  change  the  state  of  Cubit.  Some  methods  actually  do  change  the  state  of  Cubit, 
either  by  changing  settings,  creating  geometry,  or  passing  in  and  executing  Cubit  commands. 

The  first  approach  needed  was  manual  testing.  We  primarily  needed  to  know  what  calls 
were  available  for  use  in  Python,  what  they  did  (or  were  supposed  to  do),  what  parameters 
were  required,  and  what  would  be  returned.  Documentation  for  this  was  generated  from 
comments  before  each  function,  so  we  only  had  as  much  information  on  their  workings  as 
there  are  in  the  code  comments.  The  comments  were  written  by  developers  who  knew  the 
code,  and  consequently  did  not  need  as  much  explanation  to  figure  out  how  to  use  the  method. 
Users  however,  do  not  have  such  a  knowledge  base,  and  the  initial  state  of  the  documentation 
for  using  Python  in  Cubit  was  sorely  lacking  in  description  and  clarity. 

Most  of  the  functions  are  named  specifically  such  that  their  purpose  is  fairly  straight¬ 
forward.  Many  have  sufficient  descriptions  to  specify  further  what  it  does.  Many  of  the 
functions,  however,  did  not  explain  what  they  were  doing  or  what  they  were  returning  to  the 
user,  or  did  not  say  anything  at  all.  All  of  the  functions  indicated  the  names  of  the  parameters 
required  and  sometimes  explained  them,  but  none  of  them  stated  exactly  what  the  expected 
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Fig.  3.1.  An  example  Python  script  for  Cubit 


parameter  was!  Is  the  user  supposed  to  enter  an  int?  A  string?  A  unique  data  structure?  An 
enum?  For  many  functions,  it  was  easy  enough  to  determine,  but  several  were  fairly  am¬ 
biguous  and  non-conforming.  This  made  the  functions  essentially  useless  except  with  a  large 
amount  of  trial  and  error  until  you  figured  out  what  exactly  the  function  was  asking  for  and 
got  your  desired  input  correct  for  it.  Hence  although  the  Python  interface  was  very  useful,  it 
was  also  not  very  usable.  Clearing  up  how  to  use  the  various  functions  available  improved  its 
utility. 

Testing  began  by  going  through  all  of  the  documented  functions  for  Python  in  the  Cubit 
module.  It  was  soon  discovered  that  several  of  the  functions  were  not  present  in  the  module. 
Upon  examining  the  code,  it  was  found  that  these  functions  existed,  but  were  not  wrapped  for 
use  by  Python.  Due  to  how  Python  handles  references  and  pointers,  it  made  sense  not  to  have 
these  available  for  Python.  Several  more  functions  were  found  that  likewise  did  not  play  well 
with  Python,  and  so  needed  to  be  hidden  from  it. 

A  first  pass  of  testing  determined  if  the  given  functions  were  actually  available  for  the 
user.  Next,  each  function  was  examined  to  determine  what  parameters  were  needed  and 
what  information  was  returned  to  the  user.  This  was  used  to  clarify  inputs  and  outputs  in 
the  documentation.  Then,  each  function  was  examined  to  see  whether  or  not  it  performed  as 
advertised.  In  some  cases,  it  performed  correctly,  but  the  documentation  was  inaccurate.  In 
other  cases,  it  did  not  do  what  it  was  intended  to  do.  Often,  the  word  and  the  deed  matched 
appropriately.  Thus  in  testing,  care  had  to  be  taken  to  ensure  accuracy,  because  a  function 
that  returned  some  value  or  object  but  didn’t  produce  an  error  could  still  be  wrong  or  broken. 

Hence,  initial  functionality  and  existence  was  established  for  the  Python  interface.  At 
that  point,  there  was  a  base  for  which  future  testing  and  development  could  be  established. 
Use  cases  for  tests  will  be  found  by  users  who  will  now  be  more  adept  at  utilizing  these 
functions  in  their  projects.  Covering  every  use  case  for  each  function  would  be  an  incredible 
feat,  and  may  even  be  redundant  for  some  commands  which  have  their  inner  workings  tested 
by  Cubit’s  test  suite.  Thus  it  is  not  pertinent  to  begin  creating  an  extensive  test  suite  for  all 
capabilities.  Rather  the  approach  should  be  incremental  as  user  problems  are  reported,  and 
tests  can  then  be  developed  to  ensure  the  bugs  are  fixed  and  desired  functionality  in  place. 

After  manual  testing  was  completed,  the  comments  in  the  code  and  help  documentation 
were  updated  to  reflect  more  accurately  the  capabilities  that  could  be  employed  by  Python. 
Now  the  work  to  be  done  was  to  create  a  way  to  automate  some  of  this  testing  to  ensure 
that  the  functionality  would  be  maintained  with  the  various  changes  and  developments  that 
occur.  The  traditional  Cubit  test  suite  uses  Cubit  errors  and  exits  to  determine  whether  a  test 
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was  successful.  The  Python  tests  were  designed  such  that  errors  would  be  reported  both  with 
Python  code  errors  and  when  the  function  is  broken  and  returns  the  wrong  value.  This  setup 
is  indicative  of  the  many  ways  in  which  bugs  may  be  manifest,  and  accounts  for  it. 

The  Python  tests  could  have  a  very  broad  scope,  as  indicated  earlier.  They  were  grouped 
in  various  ways,  but  generally  with  other  functions  that  either  provided  similar  information  or 
dealt  with  similar  objects  in  Cubit.  For  example,  there  is  a  test  dealing  with  all  the  information 
retrieval  commands  drelated  to  non-exodus  boundary  conditions,  and  there  is  a  test  dealing 
with  any  commands  that  count  the  number  of  a  particular  element  in  the  model.  These  tests 
are  created  only  to  indicate  basic  functionality  of  the  methods  and  test  for  basic  bugs.  Much 
of  the  taxing  of  code  robustness  is  covered  by  the  general  test  suite.  To  reiterate  the  point,  the 
full  extent  of  what  could  be  tested  with  the  Python  interface  is  extremely  massive,  so  these 
tests  are  designed  to  ensure  that  the  functionality  continues  to  exist  from  Python  and  not  to 
try  the  extreme  limits  of  that  functionality. 

4.  User  Metrics.  In  order  to  make  further  and  relevant  improvements  to  the  code,  it  is 
important  to  have  information  about  how  the  program  is  being  used  and  how  it  is  not  being 
used.  User  metrics  will  give  developers  a  more  keen  insight  into  how  to  best  meet  the  needs 
of  users  and  what  features  or  projects  should  have  more  time  and  money  spent  on  them. 

The  first  step  was  to  record  use  and  find  out  how  many  people  regularly  use  Cubit,  as 
seen  in  4.1.  This  is  a  good  general  barometer  for  the  current  success  of  the  program.  The  next 
indicator  implemented  was  to  record  when  Cubit  abnormally  exits  operation.  Many  crashes 
in  the  program  are  difficult  to  reproduce  because  the  chain  of  events  that  produced  each  one 
is  not  always  recorded  by  commands  issued.  GUI  usage  is  what  throws  things  off,  and  the 
combination  of  commands  and  clicks  is  hard  for  the  user  to  remember.  Often,  the  user  does 
something  to  get  Cubit  into  an  unstable  state,  and  then  some  action  will  cause  the  crash. 
Developers  are  unable  to  reproduce,  and  thus  unable  to  fix.  Currently,  the  program  reports 
when  it  crashed  and  an  exit  code,  and  the  rest  of  the  desired  information  is  obtained  from  the 
user.  This  is  general  data  to  see  what  types  of  situations  typically  cause  a  crash,  so  we  have  a 
better  heading  on  how  to  approach  a  solution.  This  effort  will  help  improve  robustness  of  the 
program,  though  may  or  may  not  improve  functionality. 

Moving  forward,  we  are  looking  at  what  other  metrics  to  gather  and  how  to  gather  them. 
Cubit  is  a  very  large  and  complex  program,  and  so  there  naturally  are  several  ways  to  do 
this.  One  way  to  go  about  gathering  the  data  is  simply  by  the  commands  issued.  Cubit  is  a 
command  driven  program,  so  this  would  report  nearly  everything  a  user  does.  That  is  also  its 
drawback.  It  would  return  a  large  amount  of  information,  and  that  return  could  slow  down 
the  speed  of  execution  and  would  then  require  filtering  to  retrieve  the  useful,  or  most  useful, 
data.  Another  problem  that  would  be  encountered  here  is  possible  spyware  from  retrieving 
proprietary  information.  Thus,  the  code  would  need  to  strip  out  all  of  the  numbers  and  names 
from  the  parameters. 

Another  way  to  gather  metrics  is  to  use  GUI  events.  This  would  limit  our  data  sampling 
to  just  GUI  version  users,  and  then  just  the  actual  functions  performed  from  the  GUI.  This 
would  be  valuable  information  for  GUI  development  and  usage,  but  would  lack  in  many 
respects  the  breadth  of  coverage  of  functions  (unless  users  do  every  single  thing  from  the 
GUI)  and  depth  in  number  of  users  (unless  most  users  use  the  GUI  version,  which  also  would 
be  valuable  to  know). 

A  third  way  still  is  to  place  several  trigger  spots  in  the  code  that  would  report  when  that 
portion  of  the  code  was  used.  These  could  correspond  to  common  classes  that  are  used  by 
commands,  or  spots  in  the  code  that  a  set  of  commands  pass  through.  A  few  different  triggers 
for  various  operations  could  be  targeted  and  implemented  to  focus  and  prioritize  the  metrics 
to  analyze. 
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Fig.  4.1.  Cubit  repeat  users 


The  first  approach  to  gathering  metrics  would  probably  be  the  most  useful  in  the  end,  as  it 
would  provide  the  most  amount  of  information.  It  could  also  be  used  to  check  the  command 
coverage  of  the  test  suite.  However,  the  performance  problem  and  implementation  effort 
required  suggests  that  this  may  not  be  the  best  approach  for  now.  What  seems  to  be  the  best 
approach  for  the  time  being  is  a  combination  of  some  of  the  GUI  events  from  approach  2  and 
the  triggers  from  approach  3.  That  way,  we  can  tailor  the  code  to  report  the  most  pertinent 
data  for  near  term  development  and  far  term  planning. 

When  we  analyze  these  metrics,  care  must  be  taken.  If  we  see  a  feature  is  not  being 
used,  it  could  be  that  it  just  isn’t  as  useful  a  feature,  and  so  less  effort  should  be  placed  on 
it.  But  it  could  also  be  the  case  that  the  feature  isn’t  used  because  it  is  not  well  publicized 
or  understood,  and  so  more  effort  should  be  placed  on  it.  Obtaining  these  metrics  will  assist 
developers  in  knowing  what  questions  to  ask,  and  help  them  determine  where  to  focus  their 
efforts  to  help  the  users  more  and  improve  the  code. 

5.  Conclusions.  While  development  is  ongoing,  testing  is  ongoing.  Whenever  you  are 
continually  altering  the  code,  there  are  chances  for  unexpected  outcomes  and  mistakes,  espe¬ 
cially  in  very  large  and  complex  codes.  Thus,  testing  must  be  maintained  to  ensure  a  robust 
and  satisfactory  program.  Employing  various  techniques  to  test  helps  cover  many  aspects 
of  the  program.  The  work  on  testing  the  Python  interface  in  Cubit  exhibits  this  as  a  lot  of 
functionality  is  present  to  be  covered,  it  needs  to  be  tested  in  different  ways.  The  process  was 
automated  to  ensure  essential  functionality  without  repeated  manual  testing.  Obtaining  user 
metrics  will  help  direct  where  further  development  and  testing  efforts  will  be  directed. 
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